# PLS in High-Dimensional Regression

##### Posted on

This note is based on Cook, R. D., & Forzani, L. (2019). Partial least squares prediction in high-dimensional regression. The Annals of Statistics, 47(2), 884–908.

## Abstract

### goal

The asymptotic behavior of predictions from PLS regression as the sample size and number of predictors diverge in various alignments.

### results

- there is a range of regression scenarios where PLS predictions have the usual $\sqrt n$ convergence rate, even when the sample size is substantially smaller than the number of predictors.
- an even wider range where the rate is slower but may still produce practically useful results.
- PLS predictions achieve their best asymptotic behavior in abundant regressions where many predictors contribute information about the response.
- Their asymptotic behavior tends to be undesirable in sparse regressions where few predictors contribute information about the response.

## Introduction

### Literatures

- Helland: the first to define a PLS regression model, and a first attempt at MLE
- Frank and Friedman: an informative discussion of PLS regression from various statistical views
- Garthiwate: a statistical interpretation of PLS algorithm
- Naik and Tsai: PLS regression provides a consistent estimator of the central subspace when the distribution of the response given the predictors follows a single-index model
- Delaigle and Hall: extended it to functional data
- Cook, Helland and Su: establish a population connection between PLS regression and envelopes in the context of multivariate linear regression, provides the first firm PLS model and showed that envelope estimation leads to $\sqrt n$ consistent estimators in traditional fixed $p$ contexts.

### Applications

- Chemometrics and Statistics communities
- the analysis of high-dimensional genomic data
- microarray-based classification
- analysis of data from PET and fMRI studies
- spatiotemporal data
- image analysis

### Statistical properties

the algorithmic nature of PLS evidently made it difficult to study using traditional statistical measures, with the consequence that PLS was long regarded as a technique that is useful, but whose core statistical properties are elusive.

- Chun and Keles: within a certain modeling framework, the PLS estimator is inconsistent unless $p/n\rightarrow 0$
- Cook and Forzani: study single-component PLS, and found that in some reasonable settings, PLS can converge at $\sqrt n$ rate as $n,p\rightarrow \infty$, regardless of the alignment between $n$ and $p$.

In this paper,

- follow the general setup of Cook and Forzani, and use traditional $(n,p)$-asymptotic arguments to provide insights into PLS predictions in multiple-component regressions.
- give bounds on the rates of convergence for PLS predictions as $n,p\rightarrow\infty$ and in doing so the authors conjecture about the value of PLS in various regression scenarios.

## PLS review

Consider typical linear regression model with univariate response $y$ and random predictor vector $X\in\bbR^p$,

\[y=\mu + \beta^T(X-\E X)+\epsilon\,,\]where $\epsilon\sim (0,\tau^2)$. Assume $(y,X)$ follows a nonsingular multivariate normal distribution and the data $(y_i,X_i),i=1,\ldots,n$ arise as independent copies of $(y,X)$. Let $Y=(y_1,\ldots,y_n)^T$ and let $F$ denote the $p\times n$ matrix with columns $(X_i-\bar X),i=1,\ldots,n$. Then

\[Y=\alpha 1_n+F^T\beta + \varepsilon\,,\]where $\alpha=\E(y)$ and $\varepsilon=(\epsilon_i)$.

Notations:

- $\Sigma=\var(X)>0$
- $\sigma=\cov(X,y)$.
- $W_q(\Omega)$: the Wishart distribution with $q$ degrees of freedom and scale matrix $\Omega$.
- $P_{A(\Delta)}$: the projection in the $\Delta>0$ inner product onto span($A$) if $A$ is a matrix or onto $A$ itself if it is a subspace.
- $P_A:=P_{A(I)}$: the projection in the usual inner product and $Q_A=I-P_A$.
- $\hat\sigma = n^{-1}FY$: the usual moment estimator of $\sigma$
- $\hat\Sigma=n^{-1}FF^T$: the usual moment estimator of $\Sigma$
- $W=FF^T\sim W_{n-1}(\Sigma)$, then $\hat\Sigma=W/n, \hat\sigma=n^{-1}(W\beta+F\varepsilon)$
- $B=\hat\Sigma^{-1}\hat\sigma$: OLS estimator of $\beta$.

Identify a dimension reduction subspace $\cH\subseteq \bbR^p$ so that $y\ind X\mid P_\cH X$ and $d:=\text{dim}(\cH)< p$. Assume a basis matrix $H\in \bbR^{p\times d}$ of $\cH$ is known and that $\hat\Sigma>0$. Following the reduction $X\mapsto H^TX$, apply OLS to estimate the coefficient vector $\beta_{y\mid H^TX}$,

\[\tilde\beta_{y\mid H^TX} = (H^T\hat\Sigma H)^{-1}H^T\hat\sigma\,.\]The known-$H$ estimator $\tilde \beta_H$ of $\beta$ is then

\[\tilde \beta_H = H\tilde \beta_{y\mid H^TX} = P_{\cH(\hat\Sigma)}B\]??: project back ?

Here

\[P_{\cH(\hat\Sigma)}= H(H^T\hat\Sigma H)^{-1} H^T\]

which implies that

- $\tilde \beta_H$ as a projection of $B$ onto $\cH$
- $\tilde \beta_H$ depends on $H$ only via $\cH$
- $\tilde \beta_H$ requires $H^T\hat\Sigma H>0$, but doesn’t actually require $\hat\Sigma>0$. This is essentially how PLS handles $n < p $ regressions: by reducing the predictors to $H^TX$ while requiring $n»d$.

THe unique and essential ingredient supplied by PLS is an algorithm for estimating $\cH$.

### SIMPLS algorithm

Aim to estimate $\cH$ is univariate regressions.

Set $w_0=0$ and $W_0=w_0$. For $k=0,\ldots,d-1$, set

\[\begin{align*} \cS_k &= \span(\Sigma W_k)\\ w_{k+1}&=Q_{\cS_k}\sigma/(\sigma^TQ_{\cS_k}\sigma)^{1/2}\\ W_{k+1}&=(w_0,\ldots,w_k,w_{k+1})\,. \end{align*}\]At termination, $\span(W_d)$ is a dimension reduction subspace $\cH$. Since $d$ is assumed to be known and effectively fixed, SIMPIS depends on only two population quantities, $\sigma$ and $\Sigma$, that must be estimated. The sample version of SIMPLS is constructed by replacing $\sigma$ and $\Sigma$ by their sample counterparts and terminating after $d$ steps, even if $\hat\Sigma$ is singular. In particular, SIMPLS does not make use of $\hat\Sigma^{-1}$ and so doesn’t require $\hat\Sigma$ to be nonsingular, but it does require $d\le \min(p,n-1)$.

If $d=p$, then $\span(W_p)=\bbR^p$ and PLS reduces to the ordinary least square estimator.

Let $G=(\sigma,\Sigma\sigma,\ldots,\Sigma^{d-1}\sigma)$ and $\hat G = (\hat\sigma,\hat\Sigma\hat\sigma,\ldots,\hat\Sigma^{d-1}\hat\sigma)$ denote population and sample Krylov matrices.

Krylov subspace: In linear algebra, the order-$r$ Krylov subspace generated by an $n\times n$ matrix $A$ and a vector $b$ of dimension $n$ is the linear subspace spanned by the images of $b$ under the first $r$ power of $A$:

\[\cK_{r,b}=\span\{b,Ab,A^2b,\ldots,A^{r-1}b\}\,.\]

Helland showed that $\span(G)=\span(W_d)$, giving a closed-form expression for a basis of the population PLS subspace, and that the sample version of the SIMPLS algorithm gives $\span(\hat G)$.

### View PLS as an envelope method

**reducing subspace**: A subspace $\cR\subseteq \bbR^p$ is a reducing subspace of $\Sigma$ if $\cR$ decomposes $\Sigma=P_\cR\Sigma P_\cR+Q_\cR\Sigma Q_\cR$ and then we say that $\cR$ reduces $\Sigma$.**the $\Sigma$-envelope of $\cS$**: The intersection of all reducing subspaces of $\Sigma$ that contain a specified subspace $\cS\subseteq \bbR^p$ is called the $\Sigma$-envelope of $\cS$ and denoted as $\cE_\Sigma(\cS)$.

Let $P_k$ denote the projection onto the $k$-th eigenspace of $\Sigma,k=1,\ldots,q\le p$. Then the $\Sigma$-envelope of $\cS$ can be constructed by projecting onto the eigenspaces of $\Sigma$: $\cE_{\Sigma}(\cS) = \sum_{i=1}^qP_k\cS$. Cook et al. showed that the population SIMPLS algorithm produces $\cE_\Sigma(\cB)$, the $\Sigma$-envelope of $\cB:=\span(\beta)$, so $\cH=\span(W_d)=\span(G)=\cE_\Sigma(\cB)$.

A non-square matrix $A$ is semi-orthogonal if either $A^TA=I$ or $AA^T=I$.

Denote

- $H\in\bbR^{p\times d}$: any semi-orthogonal basis matrix for $\cE_\Sigma(\cB)$
- $(H,H_0)\in\bbR^{p\times p}$: an orthogonal matrix

The following envelope model for PLS:

\[\begin{equation} \begin{split} y &= \mu + \beta_{y\mid H^TX}^TH^T(X-\E(X))+\epsilon\\ \Sigma &= H\Sigma_H H^T + H_0\Sigma_{H_0}H_0^T\,, \end{split}\label{2.3} \end{equation}\]where $\Sigma_H =\var(H^TX)$ and $\Sigma_{H_0}=\var(H^T_0X)$. By re-parameterization,

\[\beta = P_{\cH(\Sigma)}\beta = H\beta_{y\mid H^TX} = H(H^T\Sigma H)^{-1}H^T\sigma = G(G^T\Sigma G)^{-1}G^T\sigma\,,\]Beginning with model $\eqref{2.3}$, Cook et al. developed likelihood-based estimators whose performance dominates that of the SIMPLS in the traditional fixed $p$ context.

From $\eqref{2.3}$, $y\ind X\mid H^TX$ and $H^TX\ind H_0^TX$, which together imply that

\[(y,H^TX)\ind H_0^TX\,.\]Model $\eqref{2.3}$ and the condition $H^TX\ind H_0^TX$ are what sets the PLS framework apart from that of sufficient dimension reduction.

Consequence of this structure:

- the distribution of $y$ can respond to changes in $H^TX$, but changes in $H_0^TX$ affect neither the distribution of $y$ nor the distribution of $H^TX$, and hence refer to $H_0^TX$ as the noise in $X$.
- the predictive success of PLS depends crucially on the relative size of $\Sigma_{H_0}$, the variability of the noise in $X$ and $\Sigma_H$ the variability in the part of $X$ that affects $y$.

## Objectives

Consider

\[\begin{align*} \beta &= G(G^T\Sigma G)^{-1}G^T\sigma\\ \hat\beta & = \hat G(\hat G^T\hat\Sigma\hat G)^{-1}\hat G^T\hat\sigma\,, \end{align*}\]study the predictive performance of $\hat\beta$ as $n$ and $p$ grow in various alignments. The difference between PLS predicted value of $y_N$ at $x_N$, $\hat y_N$,

\[\hat y_N-y_N = O_p\{(\hat\beta-\beta)^T(X_N-\E(X))\} + \epsilon_N\qquad \text{as } n,p\rightarrow\infty\,.\]The estimative performance of $\hat\beta$ is measured by

\[D_N:=(\hat\beta-\beta)^T\omega_N\,,\]where $\omega_N=X_N-\E(X)\sim N(0,\Sigma)$. The goal is to determine asymptotic properties of $D_N$ as $n,p\rightarrow\infty$. Since

\[\var(D_N\mid\hat\beta) = (\hat\beta-\beta)^T\Sigma(\hat\beta-\beta)\,,\]results for $D_N$ also tell us about the asymptotic behavior of $\hat\beta$ in the $\Sigma$ inner product.

## Overarching considerations

### Dimension $d$ of $\cE_\Sigma(\cB)$

assume $d$ is known and constant for $p\ge d$, although it may increase for a time with $p$.

### Signal and noise in $X$

Under the envelope construction, $\cB\subseteq \cE_\Sigma(\cB)$, and for any nonnegative integer $k$,

\[\Sigma^k = H\Sigma_H^kH^T + H_0\Sigma_{H_0}^kH_0^T\,.\]Our asymptotic results depend fundamentally on the sizes of $\Sigma_H$ and $\Sigma_{H_0}$. Define $\eta(p):\bbR\mapsto\bbR$ and $\kappa(p):\bbR\mapsto \bbR$ as

\[\begin{align*} \tr(\Sigma_H)\asymp \eta(p)&\ge 1\\ \tr(\Sigma_{H_0})\asymp \kappa(p) \end{align*}\]where we imposed the condition $\eta(p)\ge 1$ without loss of generality.

#### intuition about $\eta(p)$

let $\lambda_i$ denote the $i$-th eigenvalue of $\Sigma_H, i=1,\ldots,d$, and assume w.l.o.g. that the columns of $H=(h_1,\ldots,h_d)$ are orthogonal eigenvectors of $\Sigma$. (??) Then

\[\begin{align*} \beta^T\Sigma\beta &= \sigma^T\Sigma^{-1}\sigma\\ &=\sigma^TH\Sigma_H^{-1}H^T\sigma\\ &=\Vert \sigma\Vert^2\Big(\frac{\sigma^TH\Sigma_H^{-1}H^T\sigma}{\sigma^TP_\cH\sigma}\Big)\\ &=\sum_{i=1}^dw_i(\Vert\sigma\Vert^2/\lambda_i) \end{align*}\]where the weights $w_i=\sigma^TP_{h_i}\sigma/\sigma^TP_\cH\sigma$, $P_{h_i}$ denotes the projection onto $\span(h_i)$ and $\sum_{i=1}^dw_i=1$. Consequently, if the $w_i$ are bounded away from 0 and if many predictors are correlated with $y$ so that $\Vert \sigma\Vert^2\rightarrow\infty$, then the eigenvalue of $\Sigma_H$ must diverge to ensure that $\beta^T\Sigma\beta$ remains bounded. We could in this case take $\eta(p)=\Vert \sigma\Vert^2$.

Suppose that the first $k$ eigenvalues $\lambda_i,i=1,\ldots,k$, diverge with $p$, that $\lambda_i\asymp \lambda_j,i,j=1,\ldots,k$, and that the remaining $d-k$ eigenvalues are a lower order, $\lambda_j=o(\lambda_i),i=1,\ldots,k, j=k+1,\ldots,d$. Then if $\Vert\sigma\Vert^2\asymp \lambda_i,i=1,\ldots,k$, we must have $w_i\rightarrow 0$ for $i=k+1,\ldots,d$ for $\beta^T\Sigma\beta$ to remain bounded.

It is possible also that the eigenvalues $\lambda_i$ are bounded. This happens in sparse regressions when only $d$ predictors are relevant.

- The trace of a $(p-d)\times (p-d)$ positive definite matrix $\kappa$ would normally be at least order of $p$, but might have a larger order $\eta$.
- The trace of a $d\times d$ matrix $\eta$, will in practice have order at most $p$ and can achieve that order in abundant regressions where $\Vert\sigma\Vert^2\asymp p$.

We can contrive cases where $p=o(\eta)$, but they seem impractical. For these reasons, the author limit the consideration to regressions in which $\eta=O(\kappa)$.

The measures $\kappa$ and $\eta$ are frequently joined naturally in the asymptotic expansions into the combined measure

\[\phi(n,p)=\frac{\kappa(p)}{n\eta(p)}\,.\]A good scenario for prediction occurs when $\phi(n,p)\rightarrow 0$ as $n,p\rightarrow \infty$. This implies a synergy between the signal $\eta$ and the sample size $n$, with the product $n\eta$ being required to dominate the variation of the noise in $X$ as measured by $\kappa$.

### Coefficients $\beta_{y\mid H^TH}$

The coefficients for the regression of $y$ on the reduced predictors $H^TX$ can be represented as $\beta_{y\mid H^TX}=\Sigma_H^{-1}\sigma_H$, where $\sigma_H=H^T\sigma\in\bbR^{p\times 1}$.

Population predictions based on the reduced predictor involve the product $\beta_{y\mid H^TH}^TH^TX$. If $\var(H^TX)=\Sigma_H$ diverges along certain directions, then the corresponding parts of $\beta_{y\mid H^TH}$ converges to 0 to balance the increases in $H^TX$ or otherwise the form $\beta^T_{y\mid H^TX}H^TX$ will not make sense asymptotically.

### Error variance $\tau^2$

Since

\[\var(y)=\beta^T\Sigma\beta+\tau^2\]is constant and $\beta^T\Sigma\beta$ is a monotonically increasing function of $p$, $\tau^2$ must correspondingly decrease with $p$.

### Asymptotic dependence

While $\span(G)=\cE_\Sigma(\cB)$, $G$ is not semi-orthogonal but the basis matrix for $\cE_\Sigma(\cB)$, $H$, is semi-orthogonal, and thus we need to keep track of any asymptotic linear dependencies among the reduced variables $G^TX\in\bbR^d$. Let

\[C = \diag^{-1/2}(G^T\Sigma G)G^T\Sigma G\diag^{-1/2}(G^T\Sigma G)\in\bbR^{d\times d}\]denote the correlation matrix for $G^TX$, and define the function $\rho(p)$ so that as $p\rightarrow\infty$

\[\tr(C^{-1})\asymp\rho(p)\,.\]And since

\[\tr(C^{-1}) = \sum_{i=1}^d(1-R_i^2)^{-1}\,,\]so $\rho$ basically describes the rate of increase in the sum of variance inflation factors.

In high-dimensional regressions, the eigenvalues of $\Sigma$ are often assumed to be bounded away from 0 and $\infty$ as $p\rightarrow\infty$.

Proposition of necessary and sufficient conditions for $\tr(C^{-1})$ to be bounded.

### Compound symmetry

Consider a regression in which $\Sigma>0$ has compound symmetry with diagonal elements all 1 and constant off diagonal element $\psi\in(0,1)$,

\[\Sigma = (1-\psi+p\psi)P_1 + (1-\psi)Q_1\,,\]where $P_1$ is the projection onto the $p\times 1$ vector of ones $1_p$.

### Universal conditions

- $(y,X)$ follows a nonsingular multivariate normal distribution
- $\phi$ and $\rho/\sqrt n\rightarrow 0$ as $n,p\rightarrow \infty$.
- $\eta = O(\kappa)$ as $p\rightarrow\infty$.
- The dimension $d$ of the envelope is known and constant for all finite $p$.
- $\Sigma>0$ for all finite $p$.

## Asymptotic results

### Order of $D_N$

Two theorems.

### Isotropic predictor variation

Calculate the rates, and they suggest that PLS predictions could be useful in high-dimensional regressions.

### Anisotropic predictor variation

Again an application of the theorems.

## Discussion

Results on the convergence of $\hat\beta$ and describe the rationale for some of th restrictions.

### Multivariate $Y$

There are numerous PLS algorithms for multivariate $Y$ and they can all produce different results. The two most common algorithms NIPLS and SIMPLS are known to produce different results when $r > 1$ but give the same result when $r=1$.

### Importance of normality

assuming normality allowed us to get relatively sharp bounds, which is useful for a first look at PLS asymptotic.

- if nearly all predictors contribute information about the response, so $\eta\asymp p$, then we may have $D_N=O_p(n^{-1/2})$ without regard to the relationship between $n$ and $p$.
- if the regression is viewed as likely sparse, so $\eta\asymp 1$, then we may have $D_N=O_p((p/n)^{1/2})$ and we now need $n$ to be large relative to $p$.

In a word, it is possible in some scenarios for PLS to have $\sqrt n$ or near $\sqrt n$ convergence rates as $n$ and $p$ diverge.