Adaptive Ridge Estimate

Posted on Mar 30, 2022

Tags: Ridge, Lasso

This note is for Grandvalet, Y. (1998). Least Absolute Shrinkage is Equivalent to Quadratic Penalization. In L. Niklasson, M. Bodén, & T. Ziemke (Eds.), ICANN 98 (pp. 201–206). Springer London.

The garrotte estimate is based on the OLS estimate $\hat\beta^0$. The standard quadratic constraint is replaced by

\[\sum_{j=1}^d \beta_j^2 / \hat\beta_j^o{}^2 \le C\]

The coefficients with smaller OLS estimate are thus more heavily penalized.

The paper proposed to use a mixture,

\[\hat\beta = \argmin\sum_{i=1}^\ell \left(\sum_{j=1}^d\beta_jx_{ij}-y_i\right)^2 + \sum_{j=1}^d \lambda_j\beta_j^2\]

Each coefficient has its own prior distribution. The priors are centered normal distributions with variances proportional to $1/\lambda_j$.

Add constraint,

\[\frac 1d\sum_{j=1}^d\frac{1}{\lambda_j} = \frac{1}{\lambda}, \quad \lambda_j > 0\]

Define new variables

\[\gamma_j = \sqrt{\frac{\lambda_j}{\lambda}}\beta_j, \qquad c_j = \sqrt{\frac{\lambda}{\lambda_j}}, j=1,\ldots,d\]

Consider the problem

\[\begin{align*} (\hat c, \hat\lambda) &= \argmin_{c, \lambda} \sum_{i=1}^\ell \left(\sum_{j=1}^dc_j\gamma_jx_{ij}-y_i\right)^2 + \lambda\sum_{j=1}^d \gamma_j^2\\ \text{s.t.} & \sum_{j=1}^dc_j^2=d, c_j\ge 0 \end{align*} \tag{5}\]

The optimality conditions are

\[\text{for }k=1,\ldots,d, \begin{cases} \sum_{i=1}^\ell x_{ik}\left(\sum_{j=1}^d\hat\beta_j x_{ij}-y_i\right) + \frac \lambda d\sign(\hat\beta_k)\sum_{j=1}^d\vert \hat\beta_j\vert = 0\\ \text{or }\hat\beta_k=0 \end{cases}\tag{6}\]

Consider
\[\newcommand\sgn{\mathrm{sign}} f = \sum_{i=1}^\ell\left(\sum_{j=1}^dc_j\gamma_jx_{ij}-y_i\right)^2 + \lambda\sum_{j=1}^d\gamma_j^2 + \mu\left( \sum_{j=1}^dc_j^2-d\right)\]
Take derivatives and let them be zero,
\[2\sum_{i=1}^\ell\left(\sum_{j=1}^dc_j\gamma_jx_{ij}-y_i\right)c_kx_{ik} + 2\lambda\gamma_k=0\\ 2\sum_{i=1}^\ell\left(\sum_{j=1}^dc_j\gamma_jx_{ij}-y_i\right)\gamma_kx_{ik} + 2\mu c_k=0\\\]
Denote $e_i = y_i - \sum_{j=1}^dc_j\gamma_jx_{ij}$, then
\[c_k\delta_k\triangleq c_k\sum_{i=1}^\ell e_ix_{ik} = \lambda \gamma_k\\ \gamma_k\delta_k\triangleq \gamma_k\sum_{i=1}^\ell e_ix_{ik} = \mu c_k\]
Note that $\beta_k=c_k\gamma_k$, then
\[c_k = \frac{\gamma_k\delta_k}\mu = \frac{\beta_k\delta_k}{\mu c_k}\]
that is
\[\begin{equation} c_k^2 =\frac{\beta_k\delta_k}{\mu}\label{eq:ck} \end{equation}\]
Since $\sum_{k=1}^d c_k^2=d$, then
\[\mu = \frac{\sum_{j=1}^d\beta_j\delta_j}{d}\]
it follows that
\[\delta_k = \frac{\lambda\gamma_k}{c_k} = \frac{\lambda\beta_k}{c_k^2} = \frac{\lambda\beta_k}{\beta_k\delta_k /\mu} = \frac{\lambda\mu}{\delta_k} = \frac{\lambda}{d}\frac{\sum_{j=1}^d\beta_j\delta_j}{\delta_k}\]
thus
\[\delta_k^2 = \frac \lambda d\sum_{j=1}^d\beta_j\delta_j\,.\]
Note that the right hand side is a constant, so
\[\delta_i^2 \triangleq\delta^2, \delta \ge 0, i=1,\ldots, d\,.\]
By $\eqref{eq:ck}$,
\[\sgn(\beta_j) = \sgn(\delta_j)\]
then
\[\begin{align*} \delta_k &= \frac \lambda d \frac{\sum_{j=1}^d\beta_j\delta_j}{\delta_k} \\ &= \frac \lambda d\frac{\sum_{j=1}^d\vert\beta_j\vert\delta \sgn(\beta_j)\sgn(\delta_j)}{\delta \sgn(\delta_k)}\\ &=\frac \lambda d\frac{1}{\sgn(\delta_k)}\sum_{j=1}^d\vert \beta_j\vert\\ &=\frac \lambda d\sgn(\beta_k)\sum_{j=1}^d\vert\beta_j\vert \end{align*}\]

These optimality conditions are the normal equation of the problem

\[\hat\beta = \argmin_\beta \sum_{i=1}^\ell \left(\sum_{j=1}^d\beta_jx_{ij}-y_i\right)^2 +\frac \lambda d\left(\sum_{j=1}^d\vert\beta_j\vert\right)^2\]

The estimate is equivalent to the lasso estimate

\[\argmin_\beta \sum_{i=1}^\ell \left(\sum_{j=1}^d\beta_jx_{ij}-y_i\right)^2 \quad\text{subject to}\quad \sum_{j=1}^d\vert\beta_j\vert \le K\,.\]

Published in categories Note

← previous next →

See all posts →

WeiYa's Work Yard

A traveler with endless curiosity, who fell into the ocean of statistics, tries to write down his ideas and notes to save himself.

Adaptive Ridge Estimate

Posted on Mar 30, 2022