WeiYa's Work Yard

A dog, who fell into the ocean of statistics, tries to write down his ideas and notes to save himself.

Gaussian Processes for Regression

Posted on
Tags: Gaussian Processes

This note is for Chapter 4 of Rasmussen, C. E., & Williams, C. K. I. (2006). Gaussian processes for machine learning. MIT Press.

Dataset $\cD$ of $n$ observations, $\cD=\{(\x_i,y_i)\mid i=1,\ldots,n\}$

move from the finite training data $\cD$ to a function $f$ that makes predictions for all possible input values.

Two common approaches:

  • restrict the class of functions, such as only considering linear functions of the input
  • give a prior probability to every possible function, where higher probabilities are given to functions that we consider to be more likely, for example because they are smoother than other functions.

A Gaussian process is a generalization of the Gaussian probability distribution.

Several ways to interpret Gaussian process (GP) regression models.

  • function-space view: think of a GP as defining a distribution over functions
  • weight-space view:

Weight-space view

The Standard Linear Model

\[f(x) = x^Tw, \qquad y=f(x) + \varepsilon, \qquad \varepsilon \sim N(0, \sigma_n^2)\]

Put a zero mean Gaussian prior with covariance matrix $\Sigma_p$ on the weights

\[w\sim N(0, \Sigma_p)\]

we can obtain

\[p(w\mid X, y) \sim N(\bar w, A^{-1})\]

where the mean is also its mode, which is also called the maximum a posterior (MAP) estimate of $w$.

In a non-Bayesian setting, the negative log prior is sometimes thought of as a penalty term, and the MAP point is known as the penalized maximum likelihood estimate of the weights.

To make predictions for a test case, we average over all possible parameter values, weighted by their posterior probability. Thus the predictive distribution for $f_\star$ at $x_\star$ is given by averaging the output of all possible linear models w.r.t. the Gaussian posterior,

\[p(f_\star\mid x_\star,X,y) = \int p(f_\star\mid x_\star,w)p(w\mid X, y)dw\]

Projections of Inputs into Feature Space

Introduce the function $\phi(x)$ which maps a D-dimensional input vector $x$ into an $N$ dimensional feature space.

\[f(x) = \phi(x)^Tw\]

Define $k(x,x’)=\phi(x)’\Sigma_p\phi(x’)$ as a covariance function or kernel.

kernel trick: if an algorithm is defined solely in terms of inner products in input space then it can be lifted into feature space by replacing occurrences of those inner products by $k(x,x’)$.

Function-space View

A Gaussian process is a collection of random variables, any finite number of which have a joint Gaussian distribution.

A GP is completely specified by its mean function and covariance function,

\[f(x) \sim GP(m(x), k(x, x'))\]

marginalization property: if the GP specifies $(y_1,y_2)\sim N(\mu,\Sigma)$, then it must also specify $y_1\sim N(\mu_1,\Sigma_{11})$

squared exponential (SE) covariance function:

\[\cov(f(x_p), f(x_q)) = k(x_p,x_q) = \exp(-\frac 12\vert x_p-x_q\vert^2)\]

The specification of the covariance function implies a distribution over functions. We can draw samples from the distribution of functions evaluated at any number of points.

Varying the Hyperparameters

The SE covariance function in one dimension has the following form

\[k_y(x_p, x_q) = \sigma_f^2 \exp\left(-\frac{1}{2\ell^2}(x_p-x_q)^2\right) + \sigma^2_n\delta_{pq}\]

Other kernel functions,

The Matérn Class of Covariance Functions

  • $\nv \rightarrow\infty$, it becomes the SE kernel
  • it becomes simple when $\nv$ is half-integer: $\nv = p+1/2$. In this case, the covariance function is a product of an exponential and a polynomial of order $p$
  • for $\nv \ge 7/2$, in the absence of explicit prior knowledge about the existence of higher order derivatives, it is probably very hard from finite noisy training examples to distinguish between values of $\nv \ge 7/2$ (or even to distinguish between finite values of $\nv$ and $\nv\rightarrow\infty$)

Rational Quadratic Covariance Function

It can be seen as a scale mixture (an infinite sum) of squared exponential (SE) covariance functions with different characteristic length-scales

if $\alpha\rightarrow\infty$, the limit of the RQ covariance is the SE covariance function

Published in categories Note