# WeiYa's Work Yard

##### Posted on Mar 30, 2022
Tags: Ridge, Lasso

The garrotte estimate is based on the OLS estimate $\hat\beta^0$. The standard quadratic constraint is replaced by

$\sum_{j=1}^d \beta_j^2 / \hat\beta_j^o{}^2 \le C$

The coefficients with smaller OLS estimate are thus more heavily penalized.

The paper proposed to use a mixture,

$\hat\beta = \argmin\sum_{i=1}^\ell \left(\sum_{j=1}^d\beta_jx_{ij}-y_i\right)^2 + \sum_{j=1}^d \lambda_j\beta_j^2$

Each coefficient has its own prior distribution. The priors are centered normal distributions with variances proportional to $1/\lambda_j$.

$\frac 1d\sum_{j=1}^d\frac{1}{\lambda_j} = \frac{1}{\lambda}, \quad \lambda_j > 0$

Define new variables

$\gamma_j = \sqrt{\frac{\lambda_j}{\lambda}}\beta_j, \qquad c_j = \sqrt{\frac{\lambda}{\lambda_j}}, j=1,\ldots,d$

Consider the problem

\begin{align*} (\hat c, \hat\lambda) &= \argmin_{c, \lambda} \sum_{i=1}^\ell \left(\sum_{j=1}^dc_j\gamma_jx_{ij}-y_i\right)^2 + \lambda\sum_{j=1}^d \gamma_j^2\\ \text{s.t.} & \sum_{j=1}^dc_j^2=d, c_j\ge 0 \end{align*} \tag{5}

The optimality conditions are

$\text{for }k=1,\ldots,d, \begin{cases} \sum_{i=1}^\ell x_{ik}\left(\sum_{j=1}^d\hat\beta_j x_{ij}-y_i\right) + \frac \lambda d\sign(\hat\beta_k)\sum_{j=1}^d\vert \hat\beta_j\vert = 0\\ \text{or }\hat\beta_k=0 \end{cases}\tag{6}$

Consider

$\newcommand\sgn{\mathrm{sign}} f = \sum_{i=1}^\ell\left(\sum_{j=1}^dc_j\gamma_jx_{ij}-y_i\right)^2 + \lambda\sum_{j=1}^d\gamma_j^2 + \mu\left( \sum_{j=1}^dc_j^2-d\right)$

Take derivatives and let them be zero,

$2\sum_{i=1}^\ell\left(\sum_{j=1}^dc_j\gamma_jx_{ij}-y_i\right)c_kx_{ik} + 2\lambda\gamma_k=0\\ 2\sum_{i=1}^\ell\left(\sum_{j=1}^dc_j\gamma_jx_{ij}-y_i\right)\gamma_kx_{ik} + 2\mu c_k=0\\$

Denote $e_i = y_i - \sum_{j=1}^dc_j\gamma_jx_{ij}$, then

$c_k\delta_k\triangleq c_k\sum_{i=1}^\ell e_ix_{ik} = \lambda \gamma_k\\ \gamma_k\delta_k\triangleq \gamma_k\sum_{i=1}^\ell e_ix_{ik} = \mu c_k$

Note that $\beta_k=c_k\gamma_k$, then

$c_k = \frac{\gamma_k\delta_k}\mu = \frac{\beta_k\delta_k}{\mu c_k}$

that is

$$$c_k^2 =\frac{\beta_k\delta_k}{\mu}\label{eq:ck}$$$

Since $\sum_{k=1}^d c_k^2=d$, then

$\mu = \frac{\sum_{j=1}^d\beta_j\delta_j}{d}$

it follows that

$\delta_k = \frac{\lambda\gamma_k}{c_k} = \frac{\lambda\beta_k}{c_k^2} = \frac{\lambda\beta_k}{\beta_k\delta_k /\mu} = \frac{\lambda\mu}{\delta_k} = \frac{\lambda}{d}\frac{\sum_{j=1}^d\beta_j\delta_j}{\delta_k}$

thus

$\delta_k^2 = \frac \lambda d\sum_{j=1}^d\beta_j\delta_j\,.$

Note that the right hand side is a constant, so

$\delta_i^2 \triangleq\delta^2, \delta \ge 0, i=1,\ldots, d\,.$

By $\eqref{eq:ck}$,

$\sgn(\beta_j) = \sgn(\delta_j)$

then

\begin{align*} \delta_k &= \frac \lambda d \frac{\sum_{j=1}^d\beta_j\delta_j}{\delta_k} \\ &= \frac \lambda d\frac{\sum_{j=1}^d\vert\beta_j\vert\delta \sgn(\beta_j)\sgn(\delta_j)}{\delta \sgn(\delta_k)}\\ &=\frac \lambda d\frac{1}{\sgn(\delta_k)}\sum_{j=1}^d\vert \beta_j\vert\\ &=\frac \lambda d\sgn(\beta_k)\sum_{j=1}^d\vert\beta_j\vert \end{align*}

These optimality conditions are the normal equation of the problem

$\hat\beta = \argmin_\beta \sum_{i=1}^\ell \left(\sum_{j=1}^d\beta_jx_{ij}-y_i\right)^2 +\frac \lambda d\left(\sum_{j=1}^d\vert\beta_j\vert\right)^2$

The estimate is equivalent to the lasso estimate

$\argmin_\beta \sum_{i=1}^\ell \left(\sum_{j=1}^d\beta_jx_{ij}-y_i\right)^2 \quad\text{subject to}\quad \sum_{j=1}^d\vert\beta_j\vert \le K\,.$

Published in categories Note