WeiYa's Work Yard

A traveler with endless curiosity, who fell into the ocean of statistics, tries to write down his ideas and notes to save himself.

Additive Model with Linear Smoother

Posted on
Tags: Additive Model, Linear Smoothers

This note is for Buja, A., Hastie, T., & Tibshirani, R. (1989). Linear Smoothers and Additive Models. The Annals of Statistics, 17(2), 453–510. JSTOR.

A number of problems associated with $p$-dimensional smoothers

Take a different approach and use the one-dimensional smoother as a building block for a restricted class of nonparametric multiple regression models.

The additive model takes the form

\[E(Y_i\mid x_{i1},\ldots, x_{ip}) = \sum_{j=1}^pf_j(x_{ij})\]

The model is a special case of both the

  • PPR (projection pursuit regression)
  • ALS (alternating least squares)
  • ACE (alternating conditional expectation)

Additive Model and Its Normal Equations

Least Squares on Populations

The optimization problem is to minimize

\[E(Y-g(X))^2\]

over $g(X)=\sum_{j=1}^pf_j(X_j)\in \cH^{add}$. Without the additivity restriction, the solution is simply $E(Y\mid X)$.

By assumption, $\cH^{add}$ is a closed subspace of $\cH$ this minimum exists and is unique, but the individual functions $f_i(X_i)$ may not be uniquely determined.

The minimizer can be characterized by residuals $Y-g(X)$ which are orthogonal to the space of fits: $Y-g(X)\ind \cH^{add}$. Since $\cH^{add}$ is generated by $\cH_i$, we have equivalently: $Y-g(X)\ind \cH_i$, or $P_i(Y-g(X))=0$. Componentwise this can be written as

\[f_i(X_i) = P_i(Y-\sum_{j\neq i}f_j(X_j))\]

Penalized Least Squares

For single-smoother,

\[\Vert y-f\Vert^2 + \lambda f^TKf\]

Assuming the inverses exist, the stationary condition implies

\[f = (I+\lambda K)^{-1}y\]

that is

\[S = (I+\lambda K)^{-1}\]

Then, characterize $f=Sy$ as a stationary solution of

\[\Vert y - f\Vert^2 + f^T(S^{-1}-I)f\]

Extend to additive regression by penalizing the RSS separately for each component function,

\[Q(f) = \Vert y - \sum_{j=1}^p\Vert^2 + \sum_{j=1}^pf_j^T(S_j^{-1}-I)f_j\]

A proof for the existence of solution for cubic smoothing spline in the appendix.

Algorithms for solving the normal equations

The Gauss-Seidel method is only one technique in the large class of iterative schemes called successive over-relaxation (SOR) methods. They differ from ordinary Gauss-Seidel procedures by the amount one proceeds in the direction of the Gauss-Seidel updates

\[f_j\leftarrow (1-\omega)f_j + \omega S_j\left(y-\sum_{k\neq j}f_k\right)\]

If the Gauss-Seidel algorithm converges, so do successive over-relaxation iterations for relaxation parameters $0 < \omega < 2$.

The numerical analysis literature also distinguishes between successive and simultaneous iterations, referred to as

  • Gauss-Seidel
  • Jacobi

Summary of Consistency, Degeneracy, and Convergence Results

It is not a priori clear when the normal equations are consistent (when solutions exist). Nor is it clear when the equations are nondegenerate (when the solutions are unique).

nondegeneracy implies consistency. However, the normal equations are almost always degenerate.

  1. For symmetric smoothers with eigenvalues in [0, 1], the normal equations always have at least one solution
  2. The solution is unique unless there exists a $g\neq 0$ such that $\hat Pg = 0$, a phenomenon we call “concurvity”. This implies for any solution $f$, $f+\alpha g$ is also a solution for any $\alpha$

Consistency

If each $S_j$ is symmetric with eigenvalues in $[0, 1]$, the normal equations are consistent for every $y$

If the smoothers $S_j$ are symmetric with eigenvalues in $[0, 1)$, the solutions of the normal equations can be written as $f_j = A_j(I+A)^{-1}y$, where $A_j = (I-S_j)^{-1}S_j$ and $A = \sum_jA_j$.

Degeneracy of smoother-based normal equations

Collinearity detection as part of regression diagnostics is a must in every good regression analysis. Practioners are usually concerned with approximate collinearity and its inflationary effects on standard errors of regression coefficients.

The term “collinearity” refers to linear dependencies among predictors as the cause of degeneracy, the term “concurvity” has been used to describe nonlinear dependencies which lead to degeneracy in additive models.

In a technical sense, concurvity boils down to collinearity of nonlinear transforms of predictors. It is more intuitive to think of it as an additive dependence $f_+=0$.

Exact concurvity is defined as the existence of a nonzero solution of the corresponding homogeneous equations

\[\hat Pg = 0\]

if such a $g$ exists, and if $f$ is a solution to $\hat Pf=\hat Qy$, then so is $f+wg$ for any $\omega$, and thus infinitely many solutions exist.

The set of all nonzero solutions to the homogeneous equations $\hat Pg=0$ will be called concurvity space for the normal equations.

It is easy to check that

\[g = \begin{bmatrix} \alpha 1\\ -\alpha 1 \end{bmatrix}\]

lies in the concurvity space of the two-smoother problem if they both reproduce constants. Similarly, for $p$ such smoothers, the concurvity space has dimension at least $p-1$.

Consider $y=0$,

\[Q(g) = \Vert g_+\Vert^2 + \sum_{j=1}^p g_j^T(S_j^{-1}-I)g_j\,,\]

defined for $g_j\in\cR(S_j)$.

If the smoothers $S_j$ are all symmetric with eigenvalues in $[0,1]$, a vector $g\neq 0$ with $g_j\in \cR(S_j)$ represents a concurvity ($\hat Pg=0$) iff one of the following equivalent conditions is satisfied

  1. $Q(g) = 0$, that is, $g$ minimizes $Q$
  2. $g_j\in \cM_1(S_j),j=1,\ldots,p$, and $g_+=0$.

Condition 2 implies that exact concurvity is exact collinearity if, for example, all smothers are cubic spline smoothers.

Remark: if $S_j,j=1,\ldots,p$ are symmetric with eigenvalues in $[0, 1)$, then $\hat P$ is nonsingular.

In practice, we separate the constant term in the additive model, and adjust each of the smooth terms to have mean 0.

For two smoothers, there exists exact concurvity iff $f_1=(S_1S_2)f_1$ for some $f_1\neq 0$

If $\Vert S_1S_2\Vert < 1$ for some matrix norm, concurvity does not exist.

For two symmetric smoothers with eigenvalues in $(-1, +1]$, concurvity exists iff $\cM_1(S_1)\cap \cM_1(S_2)\neq 0$.

It has again the consequence that exact concurvity, e.g., for a pair of cubic spline smoothers, can only be an exact collinarity between the untransformed predictors, since cubic splines preserve constant and linear fits (what is the logic?).

uniqueness of the additive minimizer is guaranteed if there is no collinearity.

The Convergence of backfitting: $p$ smoothers

If all the smoothers $S_j$ are symmetric with eigenvalues in $[0, 1]$, then the backfitting algorithm converges to some solution of the normal equations.

Convergence of backfitting for two smoothers

Decompose $S_1=\tilde S_1+H_U$ and $S_2=\tilde S_2+H_U$, where $H_U$ is the orthogonal projection onto $U = \cM_1(S_1)\cap \cM_1(S_2)$. We have $\tilde S_jH_U=H_U\tilde S_j=0$ and $\Vert \tilde S_1\tilde S_2\Vert_2 < 1$.

Consider first $y$ and $f_2^0$ in $U^\ind$,

second for $y$ and $f_2^0$ in U

If $S_1$ and $S_2$ are symmetric with eigenvalues in $(-1, 1]$, then the Gauss-Seidel algorithm converges to a solution of the normal equations

The components $f_1^\infty$ and $f_2^\infty$ can be decomposed into

  • the part within $U^\ind$ which is uniquely determined and depends on the data $y$ only
  • the part within $U$ which depends on the sequence of iteration and the initialization $f_2^0$

Modified Backfitting Algorithm

  1. Initialize $\tilde f_1,\ldots,\tilde f_p$, and set $\tilde f_+ = \tilde f_1+\tilde f_2+\cdots+\tilde f_p$
  2. Regress $y-\tilde f_+$ onto the space spanned by $\cM_1(S_1),\ldots,\cM_1(S_p)$, that is, set $g=H(y-\tilde f_+)$
  3. Fit an additive model to $y-g$ using smoothers $\tilde S_i$; this step yields an additive fit $\tilde f_+=\tilde f_1+\cdots +\tilde f_p$
  4. repeat steps 1 and 2 until convergence.

A closer look at convergence

Consider the case of extreme collinearity, where two identical covariates and a cubic spline smoother matrix $S$.

Starting the backfitting algorithm from $0$, the residual after $m$ smooths is given by

\[r^m = [I-S+S^2-S^3+\cdots + (-1)^{m-1}S^{m-1}](I-S)y\rightarrow (I+S)^{-1}(I-S)y\]
  1. the residuals (and their norm) oscillate as they converge
  2. the converged model is rougher than a single smoother.
  3. by looking at every other iteration, it is clear that the norm of the residuals converges upwards, after every even number of steps
  4. $r^2$ is the same as the “twicing” residual, where twicing enhances a smooth by adding in the smooth of the residual.

Published in categories Note