# Bayesian Conjugate Gradient Method

##### Posted on

This note is for Cockayne, J., Oates, C. J., Ipsen, I. C. F., & Girolami, M. (2018). A Bayesian Conjugate Gradient Method. Bayesian Analysis.

## Linear Solvers

For the solution of linear systems,

\begin{equation} Ax^*=b\label{1} \end{equation}

where $A\in\IR^{d\times d}$ is an invertible matrix and $b\in \IR^d$ is a vector, while $x^*\in\IR^d$ is to be determined.

- iterative methods: Krylov subspace methods (among the most successful at obtaining an approximate solution at low cost)
- direct methods: LU or Cholesky decomposition

The conjugate gradient (CG) method is a popular iterative method and perhaps the first instance of a Krylov subspace method, but the convergence is slowed when the system is poorly conditioned. Then we can consider solving equivalent **preconditioned** systems, by solving

- left-preconditioning: $P^{-1}Ax^*=P^{-1}b$
- right-preconditioning: $AP^{-1}Px^*=b$

where $P$ is chosen both so that

- $P^{-1}A$ (or $AP^{-1}$) has a lower condition number than $A$ itself
- computing the solution of systems $Py=c$ is computationally inexpensive for arbitrary $y$ and $c$

In situations where numerical error cannot practically be made negligible, an estimate for the error $x_m-x^*$ must accompany the output $x_m$ of any linear solver. In practice, the algorithm is usually terminated when this reaches machine precision, which can require a very large number of iterations and substantial computational effort. This often constitutes the principal bottleneck in contemporary applications.

**The contribution** of the paper is to demonstrate how Bayesian analysis can be used to develop a richer, probabilistic description for the error in estimating the solution $x^*$ with an iterative method.

## Probabilistic Numerical Methods

Bayesian probabilistic numerical methods posit a prior distribution for the unknown (here $x^*$), and condition on a finite amount of information about $x^*$ to obtain a posterior that reflects the level of uncertainty in $x^*$, given the finite information obtained.

Recent work: Hennig (2015) treated the problem as an inference problem for the matrix $A^{-1}$, and established correspondence with existing iterative methods by selection of different matrix-valued Gaussian priors within a Bayesian framework.

In contrast, the paper places a prior on the solution $x^*$ rather than the matrix $A^{-1}$.

## Probabilistic Linear Solver

The Bayesian method is defined by the choice of prior and the information on which the prior is to be conditioned.

- the information about $x^*$ is linear and is provided by
**search directions**$s_i,i=1,\ldots,m « d$, through the matrix-vector products

The matrix-vector products on the right-hand-side are assumed to be computed without error, which implies that a likelihood model in the form of a Dirac distribution:

where $S_m$ denotes the matrix whose columns are $s_1,\ldots,s_m$.

Linear information is well-adapted to inference with *stable distribution**(Let $X_1$ and $X_2$ be independent copies of a random variable $X$. Then $X$ is said to be stable if, for any constants $\alpha,\beta>0$, the random variable $\alpha X_1+\beta X_2$ has the same distribution as $\gamma X+\delta$ for some constants $\gamma > 0$ and $\delta$.)* Let $x$ be a random variable, which will be used to model epistemic uncertainty regarding the true solution $x^*$, and endow $x$ with the prior distribution

where $x_0$ and $\Sigma_0$ are each assumed to known a-priori.

Now there exists a unique Bayesian probabilistic numerical method which outputs the conditional distribution $p(x\mid y_m)$, where $y_m=[y_1,\ldots,y_m]^T$ satisfies $y_m=S_m^TAx^*=S_m^Tb$.

## Conjugate Gradient Method

The problem can be seen as

where $\calK_m$ is a sequence of $m$-dimensional linear subspaces of $\IR^d$, or

where $f(x)=\frac 12x^TAx-x^Tb$.

Let $S_m\in \IR^{d\times m}$ denote a matrix whose columns are arbitrary linearly independent search directions $s_1,\ldots,s_m$, with $\range(S_m)=\calK_m$. Let $x_0$ denote an arbitrary starting point for the algorithm. Then $x_m =x_0+S_mc$ for some $c\in\IR^m$ which can be computed by solving $\nabla f(x_0+S_mc)=0$.

In CG, the search directions are $A$-conjugate, then

which lends itself to an iterative numerical method

## Search Directions

Two choices:

- Optimal Information
- Conjugacy: Bayesian Conjugate Gradient Method
*the name is for the same reason as in CG, as search directions are chosen to be the direction of gradient descent subject to a conjugacy requirement, albeit a different one than in standard CG*

## Krylov Subspace Method

It is well-known that CG is a Krylov subspace method for symmetric positive-definite matrices $A$, meaning that

## Prior Choice

### Covariance Structure

- Natural Prior
- Preconditioner Prior
- Krylov Subspace Prior

### Covariance Scale

Treat the prior scale as an additional parameter to be learned, and can consider $\nu$ in a hierarchical Bayesian framework.

## Code

Check github.com/jcockayne/bcg