Partial Least Squares for Functional Data
Posted on
This post is based on Delaigle, A., & Hall, P. (2012). Methodology and theory for partial least squares applied to functional data. The Annals of Statistics, 40(1), 322–352.
PLS is an iterative procedure for estimating the slope of linear models. The paper provided a transparent account of theoretical issues that underpin PLS methods in linear models for prediction from functional data, and show that they motivate an alternative formulation of PLS (APLS) in that setting.
Functional linear models
General bases for inference in functional linear models
a sample of independent data pairs,
\[\cX = \{(X_1,Y_1),\ldots, (X_n,Y_n)\}\]all distributed as $(X,Y)$, where
- $X$: a random function defined on the nondegenerate, compact interval $\cI$ and satisfying $\int_\cI E(X^2) < \infty$
- $Y$: a scalar random variable generated by the linear model $Y = a + \int_\cI bX + \varepsilon\,.$
- $a$: a scalar parameter
- $\varepsilon$: a scalar random variable with finite mean square and satisfying $E(\varepsilon\mid X)=0$
- $b$: a function-valued parameter, a square-integrable function on $\cI$.
Predicting $Y$ given $X$ amounts to estimating the function
\[g(x) = E(Y\mid X=x) = a+\int_\cI bx\,.\]Expansions for $X$ and $b$ in terms of an orthonormal basis $\psi_1, \psi_2,\ldots$ defined on $\cI$:
\(X = \sum_j(\int_\cI X\psi_j)\psi_j\,, b=\sum_jv_j\psi_j\,,\) where $v_j = \int_\cI b\psi_j$.
Note that $\int_\cI bX=\sum_j v_j\int_\cI X\psi_j$, which motivates us to take $a = E(Y) - \int_\cI bE(X)$ and define $\beta_1,\ldots,\beta_p$ to be the sequence $v_1,\ldots,v_p$ that minimizes
\[s_p(v_1,\ldots,v_p) = E\left\{ \int_\cI b(X-EX) - \sum_{j=1}^p v_j \int_\cI (X-EX)\psi_j \right\}^2\,.\]The functions
\[\begin{align*} b_p &= \sum_{j=1}^p\beta_j\psi_j\\ g_p(x) &= E(Y) + \int_\cI b_p(x-EX) = EY + \sum_{j=1}^p \beta_j\int_\cI (x-EX)\psi_j \end{align*}\]are approximations to $b$ and $g(x)$ respectively. Their accuracy, as $p$ increases, depends on the choice of the sequence $\psi_1,\psi_2$,\ldots
Sometimes the basis is chosen independently of the data (e.g. sine-cosine basis, spline basis, etc)(the choice of knots also should be depended on the data?). A drawback of such bases is that there is no reason why their first $p$ elements should capture the most important information about the regression function $g$.
Principal component basis
One of the most popular adaptive bases is the so-called principal component basis, constructed from the covariance function $K(s, t)=\cov(X(s), X(t))$ of the random process $X$. Also use the notation $K$ for the linear transformation (a functional) that takes a square-integrable function $\psi$ to $K(\psi)$ given by $K(\psi)(t)=\int_\cI\psi(s)K(s,t)ds$.
Since $\int_\cI E(X^2) < \infty$, then $\int_\cI K(t, t)dt < \infty$, and we can write the spectral decomposition of $K$ as
\[K(s, t) = \sum_{k=1}^\infty \theta_k\phi_k(s)\phi_k(t)\,,\]where the principal component basis $\phi_1,\phi_2,\ldots$ is a complete orthonormal sequence of eigenvectors (i.e., eigenfunctions) of the transformation $K$, with respective nonnegative eigenvalues $\theta_1,\ldots,$ That is, $K(\phi_k)=\theta_k\phi_k$ for $k \ge 1$.
Positive definiteness of $K$ implies that the eigenvalues are nonnegative, and the condition $\int_\cI E(X^2) < \infty$ entails $\sum_k\theta_k < \infty$. Therefore, we can order the terms so that
\[\theta_1 \ge \theta_2 \ge \cdots \ge 0\,.\]The covariance function is estimated by
\[\hat K(s, t) = \frac 1n\sum_{i=1}^n \{X_i(s)-\bar X(s)\}\{X_i(t)-\bar X(t)\}\,,\]where $\bar X(t)=n^{-1}\sum_{i=1}^nX_i(t)$.
The orthonormal PLS basis
The standard PLS basis, adapted to the functional context, is defined iteratively by choosing $\psi_p$ in a sequential manner, to maximize the covariance functional
\[f_p(\psi_p) = \cov\left\{ Y-g_{p-1}(X), \int_\cI X\psi_p \right\}\,,\]subject to
\[\int_\cI\int_\cI \psi_j(s)K(s, t)\psi_p(t)dsdt = 0\quad \text{for }1\le j\le p-1\]and
\[\Vert \psi_p\Vert = 1\,.\]