Bayesian Conjugate Gradient Method

Posted on Jun 27, 2019

This note is for Cockayne, J., Oates, C. J., Ipsen, I. C. F., & Girolami, M. (2018). A Bayesian Conjugate Gradient Method. Bayesian Analysis.

Linear Solvers

For the solution of linear systems,

\begin{equation} Ax^*=b\label{1} \end{equation}

where $A\in\IR^{d\times d}$ is an invertible matrix and $b\in \IR^d$ is a vector, while $x^*\in\IR^d$ is to be determined.

iterative methods: Krylov subspace methods (among the most successful at obtaining an approximate solution at low cost)
direct methods: LU or Cholesky decomposition

The conjugate gradient (CG) method is a popular iterative method and perhaps the first instance of a Krylov subspace method, but the convergence is slowed when the system is poorly conditioned. Then we can consider solving equivalent preconditioned systems, by solving

left-preconditioning: $P^{-1}Ax^*=P^{-1}b$
right-preconditioning: $AP^{-1}Px^*=b$

where $P$ is chosen both so that

$P^{-1}A$ (or $AP^{-1}$) has a lower condition number than $A$ itself
computing the solution of systems $Py=c$ is computationally inexpensive for arbitrary $y$ and $c$

In situations where numerical error cannot practically be made negligible, an estimate for the error $x_m-x^*$ must accompany the output $x_m$ of any linear solver. In practice, the algorithm is usually terminated when this reaches machine precision, which can require a very large number of iterations and substantial computational effort. This often constitutes the principal bottleneck in contemporary applications.

The contribution of the paper is to demonstrate how Bayesian analysis can be used to develop a richer, probabilistic description for the error in estimating the solution $x^*$ with an iterative method.

Probabilistic Numerical Methods

Bayesian probabilistic numerical methods posit a prior distribution for the unknown (here $x^*$), and condition on a finite amount of information about $x^*$ to obtain a posterior that reflects the level of uncertainty in $x^*$, given the finite information obtained.

Recent work: Hennig (2015) treated the problem as an inference problem for the matrix $A^{-1}$, and established correspondence with existing iterative methods by selection of different matrix-valued Gaussian priors within a Bayesian framework.

In contrast, the paper places a prior on the solution $x^*$ rather than the matrix $A^{-1}$.

Probabilistic Linear Solver

The Bayesian method is defined by the choice of prior and the information on which the prior is to be conditioned.

the information about $x^*$ is linear and is provided by search directions $s_i,i=1,\ldots,m « d$, through the matrix-vector products

\[y_i:=(s_i^TA)x^* = s_i^Tb\]

The matrix-vector products on the right-hand-side are assumed to be computed without error, which implies that a likelihood model in the form of a Dirac distribution:

\[p(y\mid x) = \delta(y-S_m^TAx)\,,\]

where $S_m$ denotes the matrix whose columns are $s_1,\ldots,s_m$.

Linear information is well-adapted to inference with stable distribution(Let $X_1$ and $X_2$ be independent copies of a random variable $X$. Then $X$ is said to be stable if, for any constants $\alpha,\beta>0$, the random variable $\alpha X_1+\beta X_2$ has the same distribution as $\gamma X+\delta$ for some constants $\gamma > 0$ and $\delta$.) Let $x$ be a random variable, which will be used to model epistemic uncertainty regarding the true solution $x^*$, and endow $x$ with the prior distribution

\[p(x) = \calN(x;x_0,\Sigma_0)\,,\]

where $x_0$ and $\Sigma_0$ are each assumed to known a-priori.

Now there exists a unique Bayesian probabilistic numerical method which outputs the conditional distribution $p(x\mid y_m)$, where $y_m=[y_1,\ldots,y_m]^T$ satisfies $y_m=S_m^TAx^*=S_m^Tb$.

Conjugate Gradient Method

The problem can be seen as

\[x_m = \arg\min_{x\in\calK_m}\Vert x-x^*\Vert_A\,,\]

where $\calK_m$ is a sequence of $m$-dimensional linear subspaces of $\IR^d$, or

\[x_m = \arg\min_{x\in\calK_m}f(x)\,,\]

where $f(x)=\frac 12x^TAx-x^Tb$.

Let $S_m\in \IR^{d\times m}$ denote a matrix whose columns are arbitrary linearly independent search directions $s_1,\ldots,s_m$, with $\range(S_m)=\calK_m$. Let $x_0$ denote an arbitrary starting point for the algorithm. Then $x_m =x_0+S_mc$ for some $c\in\IR^m$ which can be computed by solving $\nabla f(x_0+S_mc)=0$.

In CG, the search directions are $A$-conjugate, then

\[x_m = x_0 + S_m(S_m)^T(b-Ax_0)\,,\]

which lends itself to an iterative numerical method

\[x_m = x_{m-1} + s_m(s_m)^T(b-Ax_{m-1})\,.\]

Search Directions

Two choices:

Optimal Information
Conjugacy: Bayesian Conjugate Gradient Method the name is for the same reason as in CG, as search directions are chosen to be the direction of gradient descent subject to a conjugacy requirement, albeit a different one than in standard CG

Krylov Subspace Method

The Krylov subspace $K_m(M, v), M\in \IR^{d\times d}, v\in \IR^d$ is defined as $$ K_m(M, v) := \span(v, Mv, M^2v, \ldots, M^mv). $$ For a vector $w\in\IR^d$, the shifted Krylov subspace is defined as $$ w+K_m(M, v) :=\span(w+v,w+Mv,\ldots,w+M^nw)\,. $$

It is well-known that CG is a Krylov subspace method for symmetric positive-definite matrices $A$, meaning that

\[x_m = \argmin_{x\in x_0+\calK_{m-1}(A,r_0)}\Vert x-x^*\Vert_A\,.\]

Prior Choice

Covariance Structure

Natural Prior
Preconditioner Prior
Krylov Subspace Prior

Covariance Scale

Treat the prior scale as an additional parameter to be learned, and can consider $\nu$ in a hierarchical Bayesian framework.

Code

Check github.com/jcockayne/bcg

Published in categories Note

← previous next →

See all posts →

WeiYa's Work Yard

A traveler with endless curiosity, who fell into the ocean of statistics, tries to write down his ideas and notes to save himself.