# Optimality for Sparse Group Lasso

##### Posted on

This note is based on Cai, T. T., Zhang, A., & Zhou, Y. (2019). Sparse Group Lasso: Optimal Sample Complexity, Convergence Rate, and Statistical Inference. ArXiv:1909.09851 [Cs, Math, Stat].

sparse group Lasso for high-dimensional double sparse linear regression, where the parameter of interest is simultaneously element-wise and group-wise sparse.

- in the noiseless case, provide matching upper and lower bounds on sample complexity for the exact recovery of sparse vectors and for stable estimation of approximately sparse vectors, respectively.
- in the noisy case, develop upper and matching minimax lower bounds for estimation error.

also consider the debiased sparse group Lasso and investigate its asymptotic property for the purpose of statistical inference.

## Introduction

The high-dimensional double sparse regression with simultaneously group-wise and element-wise sparsity structures

\[y=X\beta^* + \varepsilon\,,\]where the covariates $X\in \IR^{n\times p}$ and $\beta^*$ are divided into $d$ known groups, where the $i$-th group contains $b_i$ variables,

\[X = [X_{(1)}, \ldots, X_{(d)}],\quad \beta^*=( (\beta^*_{(1)})^T,\ldots, (\beta^*_{(d)})^T )^T,\quad X_{(j)}\in \IR^{n\times b_i},\beta_{(j)}^*\in \IR^{b_i};\]$\beta^*$ is a $(s, s_g)$-sparse vector in the sense that

\[\Vert \beta^*\Vert_{0,2}:=\sum_{j=1}^d1_{\beta^*_{(j)}\neq 0}\le s_g\quad \text{and}\quad \Vert \beta^*\Vert_0:=\sum_{i=1}^p 1_{\beta_i^*\neq 0}\le s\,.\]The focus of the paper: estimation and inference for $\beta^*$ based on $(y, X)$.

Applications:

- GWAS, the genes can be grouped into pathways and it is believed that only a small portion of the pathways contain causal SNPs, and the number of causal SNPs is much less than the one of non-causal SNPs in a causal pathway.
- cancer diagnosis and therapy

The sparse group Lasso provides a classic and straightforward estimator for $\beta^*$:

\[\hat\beta = \argmin_\beta\Vert y-X\beta\Vert^2_2 + \lambda\Vert \beta\Vert_1 + \lambda_g \Vert \beta\Vert_{1,2}\,,\]where $\Vert \beta\Vert_{1,2}=\sum_j\Vert \beta_{(j)}\Vert_2$.

In the noiseless setting that $\varepsilon=0$, it turns out to be

\[\hat \beta = \argmin \lambda \Vert \beta\Vert_1 + \lambda_g \Vert \beta\Vert_{1,2}\qquad \text{s.t. } y=X\beta\,.\]To estimate the simultaneously element-wise and group-wise sparse vector $\beta^*$, despite many empirical success of sparse group Lasso in practice, the theoretical properties, including optimal rate of convergence and sample complexity, are still unclear so far to the authors’ knowledge.

### Simultaneously Structured Models

the parameter of interest has multiple structures at the same time

- high-dimensional double sparse regression
- sparse principal component analysis
- tensor singular value decomposition
- simultaneously sparse and low-rank matrix/tensor recovery
- sparse matrix/tensor SVD
- sparse phase retrieval

The seminal work of Oymak et al. by minimizing multi-objective regularizers with norms associated with these structures (such as $\ell_1$ norm for element-wise sparsity, nuclear norm for low-rankness, and total variation norm for piecewise constant structure.), one usually cannot do better than applying an algorithm that only exploits one structure. They particularly illustrated that simultaneously sparse and low-rank structured matrix cannot be well estimated by penalizing $\ell_1$ and nuclear norm regularizers. However, based on their results, it remains an open question whether the convex regularization, such as sparse group Lasso or $\ell_1+\ell_{1,2}$ minimization, can achieve good performance in estimation of parameter with two types of sparsity structures.

### Optimality and Related Literature

The paper claims that, it fills the void of statistical limits of sparse group Lasso and provides an affirmative answer to the aforementioned questions: by exploiting both element-wise and group-wise sparsity structures, the $\ell_1+\ell_{1,2}$ regularization does provide better performance in high-dimensional double sparse regression.

- in the noiseless case,
- it is shown that $(s,s_g)$-sparse vectors can be exactly recovered and approximately $(s,s_g)$-sparse vectors can be stably estimated with high probability whenever the sample size satisfies $n…$
- the exact recovery cannot be achieved by $\ell_1+\ell_{1,2}$ regularization and stable estimation of approximately $(s, s_g)$-sparse vectors is impossible in general unless $n…$

- in the noisy case, develop the matching upper and lower bounds on the convergence rate for the estimation error.
- study the statistical inference for the individual coordinates of $\beta^*$

Non-trivial theoretical analysis:

- the regularizer is not decomposable w.r.t. the support $\beta^*$ so that the classic techniques of decomposable regularizers and null space property.

Statistical properties of sparse lasso and related estimators:

- developed consistency for estimators with a general tree-structured norm regularizers.
- the asymptotically behaviors of the adaptive sparse group lasso estimator
- the multitask learning and classification problem based on a variant of sparse group Lasso estimator
- multivariate linear regression via sparse Lasso
- theoretical framework for developing error bounds of the group Lasso, sparse group Lasso, and group Lasso with tree structured overlapping groups.

The authors claim that, it is the first paper that provides optimal theoretical guarantees for both the sample complexity and estimation error of sparse group Lasso.