WeiYa's Work Yard

A dog, who fell into the ocean of statistics, tries to write down his ideas and notes to save himself.

Conjugate Gradient for Regression

Posted on 0 Comments
Tags: Conjugate Gradient, Linear Regression

The conjugate gradient method is an iterative method for solving a linear system of equations, so we can use conjugate method to estimate the parameters in (linear/ridge) regression.


Considering a linear system of equations


where $A$ is a $n\times n$ symmetric positive definite matrix. It can be stated equivalently as the following minimization problem

\[\min\;\phi(x) =\frac{1}{2}x^TAx-b^Tx\]

The gradient of $\phi$ equals the residual of the linear system,

\[\nabla \phi=Ax-b=r(x)\]

so in particular at $x=x_k$ we have


Conjugate Direction Method

A set of nonzero vectors ${p_0,p_1,\ldots,p_l}$ is said to be conjugate with respect to the symmetric definite matrix $A$ if

\[p_i^TAp_j=0\;\text{for all }i\neq j\]

Given a starting point $x_0\in \IR^n$ and a set of conjugate directions ${p_0,p_1,\ldots,p_{n-1}}$, let us generate the sequence ${x_k}$ by


where $\alpha_k$ is the one dimensional minimizer of the quadratic function $\phi(\cdot)$ along $x_k+\alpha p_k$, given explicitly by


Consider the function $\phi(\alpha)$ with respect step length $\alpha$

\(\phi(\alpha)=f(x_k+\alpha p_k)\) where $f(x) = \frac 12x^TAx-b^Tx$. Let $\phi’(\alpha)=0$, we have \(\alpha_k=-\frac{\nabla f_k^Tp_k}{p_k^TAp_k}\)

For any $x_0\in \IR^n$, the sequence $x_k$ by the conjugate direction algorithm converges to the solution $x^* $ of the linear system in at most $n$ steps.
Let $x_0\in \IR^n$ be any starting point and suppose that the sequence $\{x_k\}$ is generated by the conjugate direction algorithm. Then $$ r_k^Tp_i=0\;\text{for }i=0,1\ldots,k-1 $$

Conjugate Gradient Method

The conjugate gradient method is a conjugate direction method with a very special property: In generating its set of conjugate vectors, it can compute a new vector $p_k$ by using only the previous vector $p_{k-1}$

Each direction $p_k$ is chosen to be a linear combination of $-r_k$ and the previous direction $p_{k-1}$, that is


where the scalar $\beta_k$ is to be determined by the requirement that $p_{k-1}$ and $p_k$ must be conjugate with respect to $A$. By premultiplying by $p_{k-1}^TA$ and imposing the condition $p_{k-1}^TAp_k=0$, we find that


Figure 1: CG-Preliminary Version

Suppose that the $k$-th iterate generated by the conjugate gradient method is not the solution point $x^* $. The following properties hold: $$ \begin{array}{ll} r_k^Tr_i&=0,\;for \;i=0,1,\ldots,k-1\\ p_k^TAp_i&=0\; for \; i=0,1,\ldots,k-1 \end{array} $$

We can simplify $\alpha_k$ by using the property $r_k^Tp_{k-1}=0$ and obtain


Also, we can simplify $\beta_k$ by using the property $r_{k+1}^Tr_k=0$ and obtain


Then we obtain the following standard form of the conjugate gradient method

Figure 2: CG-Standard Version

Application in Regression

For linear regression, we can recognize $X’X$ as $A$ and $X’Y$ as $b$. It is cautious when there is multiple collinearity in $X$. Because $A$ should be a symmetric positive definite matrix, while $X’X$ doesn’t satisfy this requirement if there is multiple collinearity in $X$. Fortunately, for ridge regression, $X’X+\lambda I,\lambda>0$ is a symmetric positive definite matrix, so we don’t worry about multiple collinearity.

The following R code is for estimating the parameters in regression by using conjugate gradient method.


Generate the dataset by random, and apply CG method to estimate the parameter of linear regression.

It is obvious that the differences of these two methods so small.

As for ridge regression, we can also test the performance by using the following R code.


  1. 邱怡轩, 共轭梯度法计算回归
  2. Jorge Nocedal, Stephen J. Wright, Numerical Optimization (Second Edition)

Published in categories Report