WeiYa's Work Yard

A traveler with endless curiosity, who fell into the ocean of statistics, tries to write down his ideas and notes to save himself.

Dantzig Selector

Posted on (Update: )
Tags: Linear Programming, Linear Model, Dantzig

This post is based on Candes, E., & Tao, T. (2007). The Dantzig selector: Statistical estimation when $p$ is much larger than $n$. The Annals of Statistics, 35(6), 2313–2351.

Consider $y=X\beta+z$, where $\beta\in\IR^p$ is a parameter vector of interest, $X$ is a data matrix with possibly fewer rows than columns, $n< < p$, and $z_i$ are i.i.d. $N(0,\sigma^2)$. The Dantzig selector solves

\[\min_{\tilde \beta\in \IR^p}\Vert \tilde \beta\Vert_{\ell_1}\quad\text{subject to }\Vert X^*r\Vert_{\ell_\infty}\le (1+t^{-1})\sqrt{2\log p}\sigma\,,\]

where $r$ is the residual vector $y-X\tilde\beta$ and $t$ is a positive scalar. The paper shows that if $X$ obeys a uniform uncertainty principle (with unitnormed columns) and if the true parameter vector $\beta$ is sufficiently sparse (which here roughly guarantees that the model is identifiable), then with very large probability,

\[\Vert \hat\beta-\beta\Vert_{\ell_2}^2\le C^2\cdot 2\log p\cdot \left(\sigma^2+\sum_i\min(\beta_i^2,\sigma^2)\right)\,.\]

The DS program can be written as

\[\min_{\tilde \beta\in\IR^p}\Vert \tilde \beta\Vert_{\ell_1}\quad \text{subject to }\Vert X^*r\Vert_{\ell_\infty}:=\sup_{1\le i\le p}\vert (X^*r)\vert \le \lambda_p \sigma\tag{DS}\]

for some $\lambda_p > 0$.

For the name, the author wrote, “with this name, we intend to pay tribute to the father of linear programming who passed away while we were finalizing this manuscript, and to underscore that the convex program (DS) is effectively a variable selection technique.”

Note that the DS program is convex, and can easily be recast as a linear program (LP),

\[\min\sum_i u_i\quad\text{subject to }-u\le \tilde \beta\le u \text{ and }-\lambda_p\sigma\1\le X^*(y-X\tilde \beta)\le \lambda_p\sigma\1\,.\]

It seems that the solution space is larger, but actually no effect on the original solution. Take a toy example, suppose we want to $\min \vert x\vert+\vert y\vert$, which can be converted to $\min u+v\text{ s.t. }-u\le x\le u, -v\le y\le v$, the original optimization problem aims to find the minimum contour, whose value equals to $u+v$, while the transformed problem aims to find the minimum vertex on the axis. Since the contour is determined by the vertex on the axis, these two problems are equivalent.

For a general linear program with inequality constraints

\[\min c^*\beta\quad\text{subject to }F\beta\le b\,,\]

define

  • $f(\beta) = X\beta-b$
  • $r_{dual} = c+F^*\lambda$
  • $r_{cent} = -\diag(\lambda)f(\beta)-\1/t$

There might be a typo in the original paper, in which $r_{dual} = c+X^*\lambda$, but it should be $F^*$.

the primal-dual method Refer to the next post for more details updates the current $(\beta,\lambda)$ by means of a Newton step, and solves

\[\begin{bmatrix} 0 & F^*\\ -\diag(\lambda)F & -\diag(f(\beta)) \end{bmatrix} \begin{bmatrix} \Delta \beta\\ \Delta \lambda \end{bmatrix} = - \begin{bmatrix} r_{dual}\\ r_{cent} \end{bmatrix}\,,\]

that is,

\[\begin{align*} \Delta \lambda &= -\diag(1/f(\beta))(\diag(\lambda)F\Delta \beta-r_{cent})\\ -[F^*\diag(\lambda/f(\beta))F]\Delta\beta &= -(r_{dual} + F^*\diag(1/f(\beta))r_{cent}) \end{align*}\,.\]

The current guess is then updated via $(\beta^+,\lambda^+)=(\beta,\lambda)+s(\Delta\beta,\Delta s)$.

Letting $U=X^*X$ and $\tilde y=X^*y$, the problem parameters have the block structure

\[F = \begin{bmatrix} I & -I\\ -I & -I\\ U & 0\\ -U & 0 \end{bmatrix},\quad b = \begin{bmatrix} 0\\ 0\\ \delta + \tilde y\\ \delta - \tilde y \end{bmatrix},\quad c = \begin{bmatrix} 0\\ 1 \end{bmatrix}\,,\]

which gives

\[F^*\diag(\lambda/f)F = \begin{bmatrix} D_1+D_2+U^*(D_3+D_4)U & D_2-D_1\\ D_2-D_1 & D_1+D_2 \end{bmatrix}\]

where $D_i=\diag(\lambda_i/f_i), 1\le i\le 4$, and

\[\begin{bmatrix} f_1\\ f_2\\ f_3\\ f_4 \end{bmatrix} = \begin{bmatrix} \tilde \beta - u\\ -\tilde \beta - u\\ U\tilde\beta -\delta -\tilde y\\ -U\tilde\beta -\delta + \tilde y \end{bmatrix}\,.\]

Put $\begin{bmatrix}r_1\ r_2\end{bmatrix} = r_{dual} + F^*\diag(1/f(\beta))r_{cent}$, it is convenient to solve the system

\[F^*\diag(\lambda/f)F\begin{bmatrix} \Delta \tilde\beta\\ \Delta u \end{bmatrix} = \begin{bmatrix} r_1\\ r_2 \end{bmatrix}\]

by elimination and obtain

\[\begin{align*} [4(D_1+D_2)^{-1}D_1D_2+U^*(D_3+D_4)*U]\Delta\tilde\beta &= r_1 - (D_1+D_2)^{-1}(D_1-D_2)r_2\\ (D_1+D_2)\Delta u &= r_2 - (D_2-D_1)\Delta\tilde\beta\,. \end{align*}\]

In other words, each step may involve solving a $p$ by $p$ system of linear equations. In fact, when $n$ is less than $p$, it is possible to solve a smaller system thanks to the Sherman-Woodbury-Morrison formula. Write $U^*(D_3+D_4)U=X^*B$, where $B=X(D_3+D_4)X^*X$ and put $D_{12}=4(D_1+D_2)^{-1}D_1D_2$. Then

\[(D_{12}+X^TB)^{-1}=D_{12}^{-1} -D_{12}^{-1}X^*(I+BD_{12}^{-1}X^*)^{-1}BD_{12}^{-1}\,.\]

The advantage is that one needs to solve the smaller $n\times n$ system $(I+BD_{12}^{-1}X^*)\beta’=b’$.

The source code can be found at the author’s homepage, the url mentioned in the paper is invalid now.


Published in categories Memo