Presistency
Posted on
The paper, Greenshtein and Ritov (2004), is recommended by Larry Wasserman in his post Consistency, Sparsistency and Presistency.
- consistent: $\Vert \hat\beta-\beta\Vert \rightarrow_p 0$
- sparsistent: $P(\supp(\hat\beta)=\supp(\beta))\rightarrow 1$
- presistent: $R(\hat\beta) - R(\beta_n^*)\rightarrow_p 0$.
The term “presistency” indeed attracts me and I take this note for this paper. It is necessary to note that Greenshtein and Ritov (2004) uses the word “persistent”, while Larry Wasserman called it “presistent” which comes from “predictive consistency”.
Abstract
Let $Z^i=(Y^i, X_1^i, \ldots, X_m^i),i=1,\ldots,n$, be iid random vectors, $Z^i\sim F,F\sim\cF$. It is desired to predict $Y$ by $\sum \beta_jX_j$, where $(\beta_1,\ldots,\beta_m)\in B^n\subset\IR^m$ under a prediction loss. Suppose that there are many more explanatory variables than observations, $m = n^\alpha, \alpha>1$. Consider sets $B^n$ restricted by the maximal number of non-zero coefficients of their members, or by their $l_1$ radius.
Study topics
- the asymptotic question: how ‘large’ may be the set $B^n$ be, so that it is still possible to select empirically a predictor whose risk under $F$ is close to that of the best predictor in the set?
- the algorithmic complexity
Main results
- give sharp bounds for orders of magnitudes under various assumptions on $\cF$.
- under various sparsity assumptions on the optimal predictor there is ‘asymptotically no harm’ in introducing many more explanatory variables than observations.
- the ‘lasso’ procedures could be efficient in finding optimal sparse predictors in high dimensions.
Introduction
Standard mathematical statistical paradigm is the existence of a ‘true’ model, and the behaviour of estimators is studied as the number of observations increases, while the model is kept fixed. The paper DOES NOT adopt this paradigm. They consider the problem of predictor selection in a given complex situation, and not that of estimation of a metaphysical unknown parameter.
Denote
\begin{equation} L_F(\beta)=\E_F\Big(Y-\sum_{j=1}^m\beta_jX_j\Big)^2\,. \label{eq:lfbeta} \end{equation}
The set of all possible predictors is too large for estimation. Minimization of the empirical analogue of \eqref{eq:lfbeta} is essentially unrelated to the minimization of \eqref{eq:lfbeta} itself. We will search for natural subsets $B^n\subset \IR^n$, so the task of selecting a (nearly) optimal predictor from $B^n$ is not too ambitious, and can be done empirically. Finally, the procedures that search for a predictor should be feasible in terms of their algorithmic complexity.
Given observations $Z^1,\ldots,Z^n$, denote their empirical distribution by $\hat F$ and let
\[L_{\hat F}(\beta) = \frac 1n\sum_{i=1}^n\Big(Y^i-\sum_{j=1}^m\beta_jX_j^i\Big)^2\,.\]Consider predictor selection methods of the following type. For some $c=c(n)$, choose
\[\hat\beta{}^n = \underset{\beta}{\arg\;\min}L_{\hat F}(\beta) + c(n)\Vert \beta\Vert_1^2\,.\]A related type of method is: for some $b=b(n)$, let
\[\hat\beta{}^n = \underset{\{\beta\mid \Vert \beta\Vert_1\le b(n)\}}{\arg\;\min} L_{\hat F}(\beta)\,.\]Study two type of sets,
- $B_k^n$: at most $k=k(n)$ non-zero entries.
- $B_b^n$: $l_1$ norm less than or equal to $b=b(n)$
Results
- The proper values of $c(n)$ and $b(n)$ are $c(n)=o((\log n/n)^{1/2})$ and $b(n)=o((n/\log n)^{1/4})$, while if $Z^i$ are multivariate normal, we should have $c(n)=o(\log n/n)$ and $b(n)=o((n/\log n)^{1/2})$.
- Show persistence wrt $B_{k(n)}^n$ for $k(n)=o(n/\log n)$, and show that the rate is optimal.
- Yield persistent and algorithmically efficient procedures wrt $B_k^n$ for $k(n)=o(n/\log n)$.
Implications of the rates
Suppose it is known that $\beta_{F_n}^*$, the (nearly) optimal predictor under $F_n$, and suppose the $k$-sparsity rate $k’(n)$(fewer than $k’(n)$ non-zero coefficients) and the $b$-sparsity rate $b’(n)$($\Vert \beta_{F_n}^*\Vert_1\le b’(n)$). Suppose now that there exist persistent procedures wrt sets $B_{k(n)}^n$ (sets $B_{b(n)}^n$), where $k(n)>k’(n)$ (where $b(n)>b’(n)$).
Then there is ‘asymptotically’ no virtue in screening in advance smaller subsets of explanatory variables. This follows since the ‘persistence rates’ $k(n)$ and $b(n)$ imply that by doing so we will not (significantly) improve on procedures that search through the entire set of explanatory variables. (in other word, the other rates are not again ‘persistence’) Yet obviously, when screening a small subset in advance, we may do harm by dropping potentially important variables.
Discussions
The fact that methods which use many more parameters than observations do not yield poor results due to overfitting is something a mystery.