Adaptive Ridge Estimate
Posted on
The garrotte estimate is based on the OLS estimate $\hat\beta^0$. The standard quadratic constraint is replaced by
\[\sum_{j=1}^d \beta_j^2 / \hat\beta_j^o{}^2 \le C\]The coefficients with smaller OLS estimate are thus more heavily penalized.
The paper proposed to use a mixture,
\[\hat\beta = \argmin\sum_{i=1}^\ell \left(\sum_{j=1}^d\beta_jx_{ij}-y_i\right)^2 + \sum_{j=1}^d \lambda_j\beta_j^2\]Each coefficient has its own prior distribution. The priors are centered normal distributions with variances proportional to $1/\lambda_j$.
Add constraint,
\[\frac 1d\sum_{j=1}^d\frac{1}{\lambda_j} = \frac{1}{\lambda}, \quad \lambda_j > 0\]Define new variables
\[\gamma_j = \sqrt{\frac{\lambda_j}{\lambda}}\beta_j, \qquad c_j = \sqrt{\frac{\lambda}{\lambda_j}}, j=1,\ldots,d\]Consider the problem
\[\begin{align*} (\hat c, \hat\lambda) &= \argmin_{c, \lambda} \sum_{i=1}^\ell \left(\sum_{j=1}^dc_j\gamma_jx_{ij}-y_i\right)^2 + \lambda\sum_{j=1}^d \gamma_j^2\\ \text{s.t.} & \sum_{j=1}^dc_j^2=d, c_j\ge 0 \end{align*} \tag{5}\]The optimality conditions are
\[\text{for }k=1,\ldots,d, \begin{cases} \sum_{i=1}^\ell x_{ik}\left(\sum_{j=1}^d\hat\beta_j x_{ij}-y_i\right) + \frac \lambda d\sign(\hat\beta_k)\sum_{j=1}^d\vert \hat\beta_j\vert = 0\\ \text{or }\hat\beta_k=0 \end{cases}\tag{6}\]Consider
\[\newcommand\sgn{\mathrm{sign}} f = \sum_{i=1}^\ell\left(\sum_{j=1}^dc_j\gamma_jx_{ij}-y_i\right)^2 + \lambda\sum_{j=1}^d\gamma_j^2 + \mu\left( \sum_{j=1}^dc_j^2-d\right)\]Take derivatives and let them be zero,
\[2\sum_{i=1}^\ell\left(\sum_{j=1}^dc_j\gamma_jx_{ij}-y_i\right)c_kx_{ik} + 2\lambda\gamma_k=0\\ 2\sum_{i=1}^\ell\left(\sum_{j=1}^dc_j\gamma_jx_{ij}-y_i\right)\gamma_kx_{ik} + 2\mu c_k=0\\\]Denote $e_i = y_i - \sum_{j=1}^dc_j\gamma_jx_{ij}$, then
\[c_k\delta_k\triangleq c_k\sum_{i=1}^\ell e_ix_{ik} = \lambda \gamma_k\\ \gamma_k\delta_k\triangleq \gamma_k\sum_{i=1}^\ell e_ix_{ik} = \mu c_k\]Note that $\beta_k=c_k\gamma_k$, then
\[c_k = \frac{\gamma_k\delta_k}\mu = \frac{\beta_k\delta_k}{\mu c_k}\]that is
\[\begin{equation} c_k^2 =\frac{\beta_k\delta_k}{\mu}\label{eq:ck} \end{equation}\]Since $\sum_{k=1}^d c_k^2=d$, then
\[\mu = \frac{\sum_{j=1}^d\beta_j\delta_j}{d}\]it follows that
\[\delta_k = \frac{\lambda\gamma_k}{c_k} = \frac{\lambda\beta_k}{c_k^2} = \frac{\lambda\beta_k}{\beta_k\delta_k /\mu} = \frac{\lambda\mu}{\delta_k} = \frac{\lambda}{d}\frac{\sum_{j=1}^d\beta_j\delta_j}{\delta_k}\]thus
\[\delta_k^2 = \frac \lambda d\sum_{j=1}^d\beta_j\delta_j\,.\]Note that the right hand side is a constant, so
\[\delta_i^2 \triangleq\delta^2, \delta \ge 0, i=1,\ldots, d\,.\]By $\eqref{eq:ck}$,
\[\sgn(\beta_j) = \sgn(\delta_j)\]then
\[\begin{align*} \delta_k &= \frac \lambda d \frac{\sum_{j=1}^d\beta_j\delta_j}{\delta_k} \\ &= \frac \lambda d\frac{\sum_{j=1}^d\vert\beta_j\vert\delta \sgn(\beta_j)\sgn(\delta_j)}{\delta \sgn(\delta_k)}\\ &=\frac \lambda d\frac{1}{\sgn(\delta_k)}\sum_{j=1}^d\vert \beta_j\vert\\ &=\frac \lambda d\sgn(\beta_k)\sum_{j=1}^d\vert\beta_j\vert \end{align*}\]
These optimality conditions are the normal equation of the problem
\[\hat\beta = \argmin_\beta \sum_{i=1}^\ell \left(\sum_{j=1}^d\beta_jx_{ij}-y_i\right)^2 +\frac \lambda d\left(\sum_{j=1}^d\vert\beta_j\vert\right)^2\]The estimate is equivalent to the lasso estimate
\[\argmin_\beta \sum_{i=1}^\ell \left(\sum_{j=1}^d\beta_jx_{ij}-y_i\right)^2 \quad\text{subject to}\quad \sum_{j=1}^d\vert\beta_j\vert \le K\,.\]