# Principal Curves

##### Posted on Sep 28, 2020

This post is mainly based on Hastie, T., & Stuetzle, W. (1989). Principal Curves. Journal of the American Statistical Association.

Principal curves are smooth one-dimensional curves that pass through the middle of a $p$-dimensional data set, providing a nonlinear summary of the data.

Type of summary for the pattern exhibited by the points on two variables:

• a trivial summary: the mean vector that simply locates the center of the cloud but conveys no information about the joint behavior of the two variables
• treat one of the variables as a response variable and the other as an explanatory variable.
• in many situations we do not have a preferred variable that we wish to label “response”, but would still like to summarize the joint behavior of $x$ and $y$
• an obvious alternative is to summarize the data by a straight line that treats the two variables symmetrically, and the first PC line is found by minimizing the orthogonal deviations

linear regression -> nonlinear function of $x$; similarly, a straight line -> smooth curve

Formally, define principal curves to be those smooth curves that are self-consistent for a distribution or data set. This means that if we pick any point on the curve, collect all of the data that project onto this point, and average them, then this average coincides with the point on the curve.

The largest PC line can be used in several situations

• errors-in-variables regression
• replace several highly correlated variables with a single variable
• factor analysis

and in all the previous situations the model can be written as

$\bfx_i = \bfu_0 + \bfa\lambda_i + \bfe_i\,,$

where $\bfu_0 + \bfa\lambda_i$ is the systematic component and $\bfe_i$ is the random component.

A natural generalization is the nonlinear model,

$\bfx_i = \bff(\lambda_i) + \bfe_i\,.$

A one-dimensional curve in $p$-dimensional space is a vector $\bff(\lambda)$ of $p$ functions of a single variable $\lambda$. These functions are called the coordinate functions, and $\lambda$ provides an ordering along the curve.

The vector $\bff’(\lambda)$ is tangent to the curve at $\lambda$ and is sometimes called the velocity vector at $\lambda$. A curve with $\Vert \bff’\Vert = 1$ is called a unit-speed parameterized curve. We can always reparameterize any smooth curve with $\Vert \bff’\Vert > 0$ to make it unit speed. Such a curve is also called parameterized by arc length since the arc length equals

$\int_0^\lambda\Vert \bff'(t)\Vert dt = \int_0^\lambda dt=\lambda\,.$

The vector $\bff’’(\lambda)$ is called the acceleration of the curve at $\lambda$, and for a unit-speed curve it is easy to check that it is orthogonal to the tangent vector. (TODO: to check) Here is an explanation from classical mechanics,

The acceleration is also a vector quantity typically decomposed into two components, one parallel to the speed $\bff’(\lambda)$, called the tangent acceleration, and the other a component normal to the speed, the so-called radial acceleration. Since the tangent speed is always unitary, then the tangent acceleration is zero, and hence $\bff’’(\lambda)$ is normal to the curve (i.e., normal to the tangent of the curve).  There is a unique unit-speed circle in the plane that goes through $\bff(\lambda)$ and has the same velocity and acceleration at $\bff(\lambda)$ as the curve itself. The radius $r_\bff(\lambda)$ of this circle is called the radius of curvature of the curve $f$ at $\lambda$, and $r_\bff(\lambda) = 1/\Vert \bff’’(\lambda)\Vert$.

Denote by $\bfX$ a random vector in $\IR^p$ with density $h$ and finite second moments. Without loss of generality, assume $\E(\bfX) = 0$. Let $f$ denote a smooth $C^\infty$ unit-speed curve in $\IR^p$ parameterized over $\Lambda \subset \IR^1$, a closed (possibly infinite) interval, that does not intersect itself ($\lambda_1\neq \lambda_2\Rightarrow f(\lambda_1)\neq f(\lambda_2)$) and has finite length inside any finite ball in $\IR^p$.

Define the projection index $\lambda_\bff: \IR^p\rightarrow \IR^1$ as

$\lambda_f(x) = \sup_\lambda\{\lambda:\Vert \bfx-\bff(\lambda)\Vert = \inf_\mu \Vert \bfx-\bff(\mu)\Vert\}$

The curve $\bff$ is called self-consistent or a principal curve of $h$ if $E(\bfX\mid \lambda_\bff(X)=\lambda) = \bff(\lambda)$ for a.e. $\lambda$.

Several interesting questions:

• for what kinds of distributions do principal exist
• how many different principal curves are there for a given distribution
• and what are their properties

The principal curve is biased for the functional model $\bfX=\bff(\lambda)+\bepsilon$ with $\bff$ smooth and $E(\bepsilon) = 0$ since $\bff$ generally seems not to be a principal curve.

Connections between principal curves and principal components

• If a straight line is self-consistent, then it is a principal component
• principal components need not be self-consistent in the sense of the definition, but they are self-consistent with respect to linear regression.

A distance property of principal curves

• principal components are critical points of the distance from the observation in terms of straight line perturbations
• the principal curve is a critical point in terms of smooth curves perturbations.

Other approaches to fit nonlinear one-dimensional manifolds

• Carroll (1969): fit a model of the form $x_i = p(\lambda_i) + e_i$, where $p(\lambda)$ is a vector of polynomials functions, and the coefficients are found by linear least squares
• Etezadi-Amoli and McDonald (1983): same as Carroll’s, but use different goodness-of-fit measure
• Shepard and Carroll (1966): proceed from the assumption that the $p$-dimensional observation vectors lie exactly on a smooth one-dimensional manifold.

## Viewpoint from Mixture Model

Tibshirani, R. (1992) viewed the principal curve problem in terms of a mixture model.

Let $\bfY = (Y_1,\ldots, Y_p)$ be a random vector with density $g_\bfY(y)$. Imagine that each $\bfY$ value was generated in two stages:

1. generate a latent variable $S$ according to some distribution $g_S(s)$
2. generate $\bfY = (Y_1,\ldots, Y_s)$ from a conditional distribution $g_{\bfY\mid s}$ having mean $\bff(S)$, with $Y_1,\ldots, Y_p$ conditionally independent given $s$

Hence, define a principal curve of $g_\bfY$ to be a triplet ${g_S, g_{\bfY\mid S}, \bff}$ satisfying the following conditions

• $g_S(s)$ and $g_{\bfY\mid s}(y\mid s)$ are consistent with $g_\bfY(y)$, that is, $g_\bfY(y)=\int g_{\bfY\mid s}g_S(s)ds$
• $Y_1,\ldots,Y_p$ are conditionally independent given $s$
• $\bff(s)$ is a curve in $\IR^p$ parameterized over $s\in\Gamma$, a closed interval in $\IR^1$, satisfying $\bff(s)=\E(\bfY\mid S=s)$.

## Comparison with Parametric spline curves

Note that the order of points are not needed, and it is determined in the fitting procedure. However, the parametric spline curves require the order of points.

A parametric spline curve in $\IR^s$ is a spline function where each $B$-spline coefficient is a point in $\IR^s$. More specifically, let $\bft = (t_i)_{i=1}^{n+d+1}$ be a knot vector for splines of degree $d$. Then a parametric spline curve of degree $d$ with knot vector $\bft$ and coefficients $\bfc=(\bfc_i)_{i=1}^n$ is given by

$\bfg(u) = \sum_{i=1}^n\bfc_iB_{i,d,\bft}(u)$

where each $\bfc_i=(c_i^1,\ldots, c_i^s)$ is a vector in $\IR^s$.

## With a Length Constraint

Kegl, B., Krzyzak, A., Linder, T., & Zeger, K. (2000) proposed a new definition.

Let the expected squared distance between $\bfX$ and $\bff$ be denoted by

$\Delta(\bff) = E[\inf_\lambda\Vert\bfX - \bff(\lambda) \Vert] = E\Vert \bfX - \bff(t_\bff(\bfX))\Vert^2\,.$

A curve $\bff^*$ is called a principal curve of length $L$ for $\bfX$ if $\bff^*$ minimizes $\Delta(\bff)$ over all curves of length less than or equal to $L$.

Neither the HS nor this definition guarantees the uniqueness of principal curves.

Consider classes $\cS_1,\cS_2$, of curves of increasing complexity. For $k\ge 1$, let $\cS_k$ be the set of polygonal (piecewise linear) curves which have $k$ segments and whose lengths do not exceed $L$.

The empirical squared error is

$\Delta_n(\bff) = \frac 1n\sum_{i=1}^n\Delta(\bfX_i, \bff)$

and let

$\bff_{k, n} = \argmin_{\bff\in \cS_k} \Delta_n(\bff)$

In practice, propose a suboptimal method with reasonable complexity which also picks the length $L$ of the principal curve automatically. Also starts from the first PC, and then increase the number of segments by one by adding a new vertex to the polygonal line produced in the previous iteration.

## Reference

Published in categories Note