# Robust Forecasting by Functional Principal Component Analysis

##### Posted on

This post is based on Hyndman, R. J., & Shahid Ullah, Md. (2007). Robust forecasting of mortality and fertility rates: A functional data approach. Computational Statistics & Data Analysis, 51(10), 4942–4956.

Propose a method for forecasting age-specific mortality and fertility rates observed over time.

- allows for smooth function of age
- robust for outlying years due to wars and epidemic
- provide a modelling framework that is easily adapted to allow for constraints and other information
- a generalization of the Lee-Carter (LC) model commonly used in mortality and fertility forecasting.

Model:

where

- $y_t(x)$ denotes the log of the observed mortality or fertility rate for age $x$ in year $t$.
- $f_t(x)$ is the assumed underlying smooth function
- $\varepsilon_{t,i}$ is an iid standard normal random variable
- $\sigma_t(x_i)$ allows the amount of noise to vary with $x$

Note that **the data are not directly of a functional nature, but that it is assumed that there are underlying functional time series which are observed with error at discrete points**.

- the functional data paradigm leads to the use of nonparametric smoothing to reduce some of the inherent randomness in the observed data
- a robust version of principal components to avoid difficulties with outlying years.

Decompose the fitted curves via a basis function expansion,

Due to the outlying years, use robust estimation for the age functions $\mu(x)$ and $\phi_k(x)$, but not for $\{\beta_{t,k}\}$, which can be used to model any outlying years by outliers in the time series.

## Constrained and weighted smoothing

Let $m_t(x)$ be the observed mortality, and $N_t(x)$ be the total population. Then $m_t(x)$ is approximately binomially distributed with estimated variance $N_t^{-1}(x)m_t(x)[1-m_t(x)]$, so the variance of $y_t(x) = \log[m_t(x)]$ is

by delta method

define weights equal to the inverse variances

and use weighted penalized regression splines to estimate the curve $f_t(x)$ in each year.

## Robust functional principal components

To seek a robust estimate of $\mu(x)$, use the $L_1$-median of the estimated smooth curves $\{\hat f_1(x),\ldots,\hat f_n(x)\}$, given by

where $\Vert g(x)\Vert = (\int g^2(x)dx)^{1/2}$ denotes the norm of the function $g$.

The

geometric medianof a discrete set of sample points in a Euclidean space is the point minimizing the sum of distances to the sample points. This generalizes the median, which has the property of minimizing the sum of distances for one-dimensional data, and provides a central tendency in higher dimensions. It is also known as1-median,spatial median. source: Geometric median – wiki

The median-adjusted data is denoted by $\hat f_t^*(x) = \hat f_t(x) - \hat\mu(x)$.

The paper introduces two methods for obtaining robust principal components for $\{\hat f_t^*(x)\}$: one uses a weighted approach and the other is based on a projection pursuit algorithm. Then combine the two methods to obtain an efficient and robust method for obtaining the basis function $\{\phi_k(x)\}$.

### Weighted principal components

Aim to find the function $\phi_k(x)$ such that the variance of the scores

is maximized subject to the constraints.

The weights $w_t$ are chosen so that outlying observations receive low weight.

### Principal components by projection pursuit

Measure the dispersion by the function $S(z_{1,k},\ldots,z_{n,k})$. If $S$ is taken to be the sample variance, then this is identical to the definition of principal components. Other measures of dispersion will lead to alternative decompositions analogous to the classical procedure. The RAPCA algorithm uses the first quartile of pairwise differences as the measure of dispersion; thus

where $\tau = \binom{\lfloor n/2 \rfloor + 1}{2} \approx \binom{n}{2}/4$ and $c_n$ is a small-sample correction factor to make $S$ an unbiased estimator.

### Two-step algorithm for functional principal components

a combination of the weighted principal component method and the RAPCA algorithm.

## Forecasting

Let $e_{n,h}(x) = y_{n+h}(x) - \hat y_{n,h}(x)$ denote the forecast error. Then the integrated squared forecast error is defined as

Choosing the order $K$ such that the ISFE on a rolling hold-out sample is minimized. That is, fit the model to data up to time $t$ and predict the next $m$ periods to obtain $\mathrm{ISFE}*t(h),h=1,\ldots,m$. Then choose $K$ to minimize $\sum*{t=N}^{n-h}\sum_{h=1}^m\mathrm{ISFE}_t(h)$ where $N$ is the minimum number of observations used to fit the model.

## Extensions

Suppose we observe $y_{t,j}(x)$ for each of $j=1,\ldots,M$ groups. Then the following models are of interest:

\begin{align}
y_{t,j}(x) &=\mu(x) + \sum_{k=1}^K\beta_{t,k}\phi_k(x) + e_{t,j}(x)

y_{t,j}(x) &=\mu_j(x) + \sum_{k=1}^K\beta_{t,k}\phi_k(x) + e_{t,j}(x)

y_{t,j}(x) &=\mu_j(x) + \sum_{k=1}^K\beta_{t,j,k}\phi_k(x) + e_{t,j}(x)

y_{t,j}(x) &=\mu_j(x) + \sum_{k=1}^K\beta_{t,j,k}\phi_{k,j}(x) + e_{t,j}(x)
\end{align}

These represent models of successively more degrees of freedom.

- the 1st one assumes no difference between the groups
- the 2nd one assumes the groups differ by an amount depending only on $x$
- the 3rd one assumes the same basis functions apply to all groups, but the coefficients differ between groups
- the 4th one allows completely different models for each group.

And more complicated variations are possible,