WeiYa's Work Yard

A dog, who fell into the ocean of statistics, tries to write down his ideas and notes to save himself.

PLS in High-Dimensional Regression

Posted on
Tags: Partial Least Squares, High-Dimensional, Dimension Reduction

This note is based on Cook, R. D., & Forzani, L. (2019). Partial least squares prediction in high-dimensional regression. The Annals of Statistics, 47(2), 884–908.

Abstract

goal

The asymptotic behavior of predictions from PLS regression as the sample size and number of predictors diverge in various alignments.

results

  1. there is a range of regression scenarios where PLS predictions have the usual $\sqrt n$ convergence rate, even when the sample size is substantially smaller than the number of predictors.
  2. an even wider range where the rate is slower but may still produce practically useful results.
  3. PLS predictions achieve their best asymptotic behavior in abundant regressions where many predictors contribute information about the response.
  4. Their asymptotic behavior tends to be undesirable in sparse regressions where few predictors contribute information about the response.

Introduction

Literatures

  1. Helland: the first to define a PLS regression model, and a first attempt at MLE
  2. Frank and Friedman: an informative discussion of PLS regression from various statistical views
  3. Garthiwate: a statistical interpretation of PLS algorithm
  4. Naik and Tsai: PLS regression provides a consistent estimator of the central subspace when the distribution of the response given the predictors follows a single-index model
  5. Delaigle and Hall: extended it to functional data
  6. Cook, Helland and Su: establish a population connection between PLS regression and envelopes in the context of multivariate linear regression, provides the first firm PLS model and showed that envelope estimation leads to $\sqrt n$ consistent estimators in traditional fixed $p$ contexts.

Applications

  1. Chemometrics and Statistics communities
  2. the analysis of high-dimensional genomic data
  3. microarray-based classification
  4. analysis of data from PET and fMRI studies
  5. spatiotemporal data
  6. image analysis

Statistical properties

the algorithmic nature of PLS evidently made it difficult to study using traditional statistical measures, with the consequence that PLS was long regarded as a technique that is useful, but whose core statistical properties are elusive.

  1. Chun and Keles: within a certain modeling framework, the PLS estimator is inconsistent unless $p/n\rightarrow 0$
  2. Cook and Forzani: study single-component PLS, and found that in some reasonable settings, PLS can converge at $\sqrt n$ rate as $n,p\rightarrow \infty$, regardless of the alignment between $n$ and $p$.

In this paper,

  • follow the general setup of Cook and Forzani, and use traditional $(n,p)$-asymptotic arguments to provide insights into PLS predictions in multiple-component regressions.
  • give bounds on the rates of convergence for PLS predictions as $n,p\rightarrow\infty$ and in doing so the authors conjecture about the value of PLS in various regression scenarios.

PLS review

Consider typical linear regression model with univariate response $y$ and random predictor vector $X\in\bbR^p$,

\[y=\mu + \beta^T(X-\E X)+\epsilon\,,\]

where $\epsilon\sim (0,\tau^2)$. Assume $(y,X)$ follows a nonsingular multivariate normal distribution and the data $(y_i,X_i),i=1,\ldots,n$ arise as independent copies of $(y,X)$. Let $Y=(y_1,\ldots,y_n)^T$ and let $F$ denote the $p\times n$ matrix with columns $(X_i-\bar X),i=1,\ldots,n$. Then

\[Y=\alpha 1_n+F^T\beta + \varepsilon\,,\]

where $\alpha=\E(y)$ and $\varepsilon=(\epsilon_i)$.

Notations:

  • $\Sigma=\var(X)>0$
  • $\sigma=\cov(X,y)$.
  • $W_q(\Omega)$: the Wishart distribution with $q$ degrees of freedom and scale matrix $\Omega$.
  • $P_{A(\Delta)}$: the projection in the $\Delta>0$ inner product onto span($A$) if $A$ is a matrix or onto $A$ itself if it is a subspace.
  • $P_A:=P_{A(I)}$: the projection in the usual inner product and $Q_A=I-P_A$.
  • $\hat\sigma = n^{-1}FY$: the usual moment estimator of $\sigma$
  • $\hat\Sigma=n^{-1}FF^T$: the usual moment estimator of $\Sigma$
  • $W=FF^T\sim W_{n-1}(\Sigma)$, then $\hat\Sigma=W/n, \hat\sigma=n^{-1}(W\beta+F\varepsilon)$
  • $B=\hat\Sigma^{-1}\hat\sigma$: OLS estimator of $\beta$.

Identify a dimension reduction subspace $\cH\subseteq \bbR^p$ so that $y\ind X\mid P_\cH X$ and $d:=\text{dim}(\cH)< p$. Assume a basis matrix $H\in \bbR^{p\times d}$ of $\cH$ is known and that $\hat\Sigma>0$. Following the reduction $X\mapsto H^TX$, apply OLS to estimate the coefficient vector $\beta_{y\mid H^TX}$,

\[\tilde\beta_{y\mid H^TX} = (H^T\hat\Sigma H)^{-1}H^T\hat\sigma\,.\]

The known-$H$ estimator $\tilde \beta_H$ of $\beta$ is then

\[\tilde \beta_H = H\tilde \beta_{y\mid H^TX} = P_{\cH(\hat\Sigma)}B\]

??: project back ?

Here

\[P_{\cH(\hat\Sigma)}= H(H^T\hat\Sigma H)^{-1} H^T\]

which implies that

  • $\tilde \beta_H$ as a projection of $B$ onto $\cH$
  • $\tilde \beta_H$ depends on $H$ only via $\cH$
  • $\tilde \beta_H$ requires $H^T\hat\Sigma H>0$, but doesn’t actually require $\hat\Sigma>0$. This is essentially how PLS handles $n < p $ regressions: by reducing the predictors to $H^TX$ while requiring $n»d$.

THe unique and essential ingredient supplied by PLS is an algorithm for estimating $\cH$.

SIMPLS algorithm

Aim to estimate $\cH$ is univariate regressions.

Set $w_0=0$ and $W_0=w_0$. For $k=0,\ldots,d-1$, set

\[\begin{align*} \cS_k &= \span(\Sigma W_k)\\ w_{k+1}&=Q_{\cS_k}\sigma/(\sigma^TQ_{\cS_k}\sigma)^{1/2}\\ W_{k+1}&=(w_0,\ldots,w_k,w_{k+1})\,. \end{align*}\]

At termination, $\span(W_d)$ is a dimension reduction subspace $\cH$. Since $d$ is assumed to be known and effectively fixed, SIMPIS depends on only two population quantities, $\sigma$ and $\Sigma$, that must be estimated. The sample version of SIMPLS is constructed by replacing $\sigma$ and $\Sigma$ by their sample counterparts and terminating after $d$ steps, even if $\hat\Sigma$ is singular. In particular, SIMPLS does not make use of $\hat\Sigma^{-1}$ and so doesn’t require $\hat\Sigma$ to be nonsingular, but it does require $d\le \min(p,n-1)$.

If $d=p$, then $\span(W_p)=\bbR^p$ and PLS reduces to the ordinary least square estimator.

Let $G=(\sigma,\Sigma\sigma,\ldots,\Sigma^{d-1}\sigma)$ and $\hat G = (\hat\sigma,\hat\Sigma\hat\sigma,\ldots,\hat\Sigma^{d-1}\hat\sigma)$ denote population and sample Krylov matrices.

Krylov subspace: In linear algebra, the order-$r$ Krylov subspace generated by an $n\times n$ matrix $A$ and a vector $b$ of dimension $n$ is the linear subspace spanned by the images of $b$ under the first $r$ power of $A$:

\[\cK_{r,b}=\span\{b,Ab,A^2b,\ldots,A^{r-1}b\}\,.\]

Helland showed that $\span(G)=\span(W_d)$, giving a closed-form expression for a basis of the population PLS subspace, and that the sample version of the SIMPLS algorithm gives $\span(\hat G)$.

View PLS as an envelope method

  • reducing subspace: A subspace $\cR\subseteq \bbR^p$ is a reducing subspace of $\Sigma$ if $\cR$ decomposes $\Sigma=P_\cR\Sigma P_\cR+Q_\cR\Sigma Q_\cR$ and then we say that $\cR$ reduces $\Sigma$.
  • the $\Sigma$-envelope of $\cS$: The intersection of all reducing subspaces of $\Sigma$ that contain a specified subspace $\cS\subseteq \bbR^p$ is called the $\Sigma$-envelope of $\cS$ and denoted as $\cE_\Sigma(\cS)$.

Let $P_k$ denote the projection onto the $k$-th eigenspace of $\Sigma,k=1,\ldots,q\le p$. Then the $\Sigma$-envelope of $\cS$ can be constructed by projecting onto the eigenspaces of $\Sigma$: $\cE_{\Sigma}(\cS) = \sum_{i=1}^qP_k\cS$. Cook et al. showed that the population SIMPLS algorithm produces $\cE_\Sigma(\cB)$, the $\Sigma$-envelope of $\cB:=\span(\beta)$, so $\cH=\span(W_d)=\span(G)=\cE_\Sigma(\cB)$.

A non-square matrix $A$ is semi-orthogonal if either $A^TA=I$ or $AA^T=I$.

Denote

  • $H\in\bbR^{p\times d}$: any semi-orthogonal basis matrix for $\cE_\Sigma(\cB)$
  • $(H,H_0)\in\bbR^{p\times p}$: an orthogonal matrix

The following envelope model for PLS:

\[\begin{equation} \begin{split} y &= \mu + \beta_{y\mid H^TX}^TH^T(X-\E(X))+\epsilon\\ \Sigma &= H\Sigma_H H^T + H_0\Sigma_{H_0}H_0^T\,, \end{split}\label{2.3} \end{equation}\]

where $\Sigma_H =\var(H^TX)$ and $\Sigma_{H_0}=\var(H^T_0X)$. By re-parameterization,

\[\beta = P_{\cH(\Sigma)}\beta = H\beta_{y\mid H^TX} = H(H^T\Sigma H)^{-1}H^T\sigma = G(G^T\Sigma G)^{-1}G^T\sigma\,,\]

Beginning with model $\eqref{2.3}$, Cook et al. developed likelihood-based estimators whose performance dominates that of the SIMPLS in the traditional fixed $p$ context.

From $\eqref{2.3}$, $y\ind X\mid H^TX$ and $H^TX\ind H_0^TX$, which together imply that

\[(y,H^TX)\ind H_0^TX\,.\]

Model $\eqref{2.3}$ and the condition $H^TX\ind H_0^TX$ are what sets the PLS framework apart from that of sufficient dimension reduction.

Consequence of this structure:

  • the distribution of $y$ can respond to changes in $H^TX$, but changes in $H_0^TX$ affect neither the distribution of $y$ nor the distribution of $H^TX$, and hence refer to $H_0^TX$ as the noise in $X$.
  • the predictive success of PLS depends crucially on the relative size of $\Sigma_{H_0}$, the variability of the noise in $X$ and $\Sigma_H$ the variability in the part of $X$ that affects $y$.

Objectives

Consider

\[\begin{align*} \beta &= G(G^T\Sigma G)^{-1}G^T\sigma\\ \hat\beta & = \hat G(\hat G^T\hat\Sigma\hat G)^{-1}\hat G^T\hat\sigma\,, \end{align*}\]

study the predictive performance of $\hat\beta$ as $n$ and $p$ grow in various alignments. The difference between PLS predicted value of $y_N$ at $x_N$, $\hat y_N$,

\[\hat y_N-y_N = O_p\{(\hat\beta-\beta)^T(X_N-\E(X))\} + \epsilon_N\qquad \text{as } n,p\rightarrow\infty\,.\]

The estimative performance of $\hat\beta$ is measured by

\[D_N:=(\hat\beta-\beta)^T\omega_N\,,\]

where $\omega_N=X_N-\E(X)\sim N(0,\Sigma)$. The goal is to determine asymptotic properties of $D_N$ as $n,p\rightarrow\infty$. Since

\[\var(D_N\mid\hat\beta) = (\hat\beta-\beta)^T\Sigma(\hat\beta-\beta)\,,\]

results for $D_N$ also tell us about the asymptotic behavior of $\hat\beta$ in the $\Sigma$ inner product.

Overarching considerations

Dimension $d$ of $\cE_\Sigma(\cB)$

assume $d$ is known and constant for $p\ge d$, although it may increase for a time with $p$.

Signal and noise in $X$

Under the envelope construction, $\cB\subseteq \cE_\Sigma(\cB)$, and for any nonnegative integer $k$,

\[\Sigma^k = H\Sigma_H^kH^T + H_0\Sigma_{H_0}^kH_0^T\,.\]

Our asymptotic results depend fundamentally on the sizes of $\Sigma_H$ and $\Sigma_{H_0}$. Define $\eta(p):\bbR\mapsto\bbR$ and $\kappa(p):\bbR\mapsto \bbR$ as

\[\begin{align*} \tr(\Sigma_H)\asymp \eta(p)&\ge 1\\ \tr(\Sigma_{H_0})\asymp \kappa(p) \end{align*}\]

where we imposed the condition $\eta(p)\ge 1$ without loss of generality.

intuition about $\eta(p)$

let $\lambda_i$ denote the $i$-th eigenvalue of $\Sigma_H, i=1,\ldots,d$, and assume w.l.o.g. that the columns of $H=(h_1,\ldots,h_d)$ are orthogonal eigenvectors of $\Sigma$. (??) Then

\[\begin{align*} \beta^T\Sigma\beta &= \sigma^T\Sigma^{-1}\sigma\\ &=\sigma^TH\Sigma_H^{-1}H^T\sigma\\ &=\Vert \sigma\Vert^2\Big(\frac{\sigma^TH\Sigma_H^{-1}H^T\sigma}{\sigma^TP_\cH\sigma}\Big)\\ &=\sum_{i=1}^dw_i(\Vert\sigma\Vert^2/\lambda_i) \end{align*}\]

where the weights $w_i=\sigma^TP_{h_i}\sigma/\sigma^TP_\cH\sigma$, $P_{h_i}$ denotes the projection onto $\span(h_i)$ and $\sum_{i=1}^dw_i=1$. Consequently, if the $w_i$ are bounded away from 0 and if many predictors are correlated with $y$ so that $\Vert \sigma\Vert^2\rightarrow\infty$, then the eigenvalue of $\Sigma_H$ must diverge to ensure that $\beta^T\Sigma\beta$ remains bounded. We could in this case take $\eta(p)=\Vert \sigma\Vert^2$.

Suppose that the first $k$ eigenvalues $\lambda_i,i=1,\ldots,k$, diverge with $p$, that $\lambda_i\asymp \lambda_j,i,j=1,\ldots,k$, and that the remaining $d-k$ eigenvalues are a lower order, $\lambda_j=o(\lambda_i),i=1,\ldots,k, j=k+1,\ldots,d$. Then if $\Vert\sigma\Vert^2\asymp \lambda_i,i=1,\ldots,k$, we must have $w_i\rightarrow 0$ for $i=k+1,\ldots,d$ for $\beta^T\Sigma\beta$ to remain bounded.

It is possible also that the eigenvalues $\lambda_i$ are bounded. This happens in sparse regressions when only $d$ predictors are relevant.

  1. The trace of a $(p-d)\times (p-d)$ positive definite matrix $\kappa$ would normally be at least order of $p$, but might have a larger order $\eta$.
  2. The trace of a $d\times d$ matrix $\eta$, will in practice have order at most $p$ and can achieve that order in abundant regressions where $\Vert\sigma\Vert^2\asymp p$.

We can contrive cases where $p=o(\eta)$, but they seem impractical. For these reasons, the author limit the consideration to regressions in which $\eta=O(\kappa)$.

The measures $\kappa$ and $\eta$ are frequently joined naturally in the asymptotic expansions into the combined measure

\[\phi(n,p)=\frac{\kappa(p)}{n\eta(p)}\,.\]

A good scenario for prediction occurs when $\phi(n,p)\rightarrow 0$ as $n,p\rightarrow \infty$. This implies a synergy between the signal $\eta$ and the sample size $n$, with the product $n\eta$ being required to dominate the variation of the noise in $X$ as measured by $\kappa$.

Coefficients $\beta_{y\mid H^TH}$

The coefficients for the regression of $y$ on the reduced predictors $H^TX$ can be represented as $\beta_{y\mid H^TX}=\Sigma_H^{-1}\sigma_H$, where $\sigma_H=H^T\sigma\in\bbR^{p\times 1}$.

Population predictions based on the reduced predictor involve the product $\beta_{y\mid H^TH}^TH^TX$. If $\var(H^TX)=\Sigma_H$ diverges along certain directions, then the corresponding parts of $\beta_{y\mid H^TH}$ converges to 0 to balance the increases in $H^TX$ or otherwise the form $\beta^T_{y\mid H^TX}H^TX$ will not make sense asymptotically.

Error variance $\tau^2$

Since

\[\var(y)=\beta^T\Sigma\beta+\tau^2\]

is constant and $\beta^T\Sigma\beta$ is a monotonically increasing function of $p$, $\tau^2$ must correspondingly decrease with $p$.

Asymptotic dependence

While $\span(G)=\cE_\Sigma(\cB)$, $G$ is not semi-orthogonal but the basis matrix for $\cE_\Sigma(\cB)$, $H$, is semi-orthogonal, and thus we need to keep track of any asymptotic linear dependencies among the reduced variables $G^TX\in\bbR^d$. Let

\[C = \diag^{-1/2}(G^T\Sigma G)G^T\Sigma G\diag^{-1/2}(G^T\Sigma G)\in\bbR^{d\times d}\]

denote the correlation matrix for $G^TX$, and define the function $\rho(p)$ so that as $p\rightarrow\infty$

\[\tr(C^{-1})\asymp\rho(p)\,.\]

And since

\[\tr(C^{-1}) = \sum_{i=1}^d(1-R_i^2)^{-1}\,,\]

so $\rho$ basically describes the rate of increase in the sum of variance inflation factors.

In high-dimensional regressions, the eigenvalues of $\Sigma$ are often assumed to be bounded away from 0 and $\infty$ as $p\rightarrow\infty$.

Proposition of necessary and sufficient conditions for $\tr(C^{-1})$ to be bounded.

Compound symmetry

Consider a regression in which $\Sigma>0$ has compound symmetry with diagonal elements all 1 and constant off diagonal element $\psi\in(0,1)$,

\[\Sigma = (1-\psi+p\psi)P_1 + (1-\psi)Q_1\,,\]

where $P_1$ is the projection onto the $p\times 1$ vector of ones $1_p$.

Universal conditions

  • $(y,X)$ follows a nonsingular multivariate normal distribution
  • $\phi$ and $\rho/\sqrt n\rightarrow 0$ as $n,p\rightarrow \infty$.
  • $\eta = O(\kappa)$ as $p\rightarrow\infty$.
  • The dimension $d$ of the envelope is known and constant for all finite $p$.
  • $\Sigma>0$ for all finite $p$.

Asymptotic results

Order of $D_N$

Two theorems.

Isotropic predictor variation

Calculate the rates, and they suggest that PLS predictions could be useful in high-dimensional regressions.

Anisotropic predictor variation

Again an application of the theorems.

Discussion

Results on the convergence of $\hat\beta$ and describe the rationale for some of th restrictions.

Multivariate $Y$

There are numerous PLS algorithms for multivariate $Y$ and they can all produce different results. The two most common algorithms NIPLS and SIMPLS are known to produce different results when $r > 1$ but give the same result when $r=1$.

Importance of normality

assuming normality allowed us to get relatively sharp bounds, which is useful for a first look at PLS asymptotic.

  • if nearly all predictors contribute information about the response, so $\eta\asymp p$, then we may have $D_N=O_p(n^{-1/2})$ without regard to the relationship between $n$ and $p$.
  • if the regression is viewed as likely sparse, so $\eta\asymp 1$, then we may have $D_N=O_p((p/n)^{1/2})$ and we now need $n$ to be large relative to $p$.

In a word, it is possible in some scenarios for PLS to have $\sqrt n$ or near $\sqrt n$ convergence rates as $n$ and $p$ diverge.


Published in categories Note