WeiYa's Work Yard

A dog, who fell into the ocean of statistics, tries to write down his ideas and notes to save himself.

Bayesian Sparse Multiple Regression

Posted on 0 Comments
Tags: Cross-Validation, Ridge, Lasso, High-Dimensional

This note is for Chakraborty, A., Bhattacharya, A., & Mallick, B. K. (2020). Bayesian sparse multiple regression for simultaneous rank reduction and variable selection. Biometrika, 107(1), 205–221.

In the context of multiple response regression, a popular technique to achieve parsimony and interpretability is to consider a reduced-rank decomposition of the coefficient matrix, commonly known as reduced rank regression.

Let $X\in \IR^{n\times p}, Y\in \IR^{n\times q}$, consider the multivariate linear regression model

\[Y = XC + E\,, \quad E=(e_1,\ldots, e_n)^T\]

where

  • the response has been centred
  • no intercept term
  • the rows of the error matrix are independent, with $e_i\sim N(0, \Sigma)$
  • the high-dimensional case, $p > \max(n, q)$
  • the dimension of the response $q$ to be modest relative to the sample size

Basic assumption in reduced rank regression is

\[\rank(C) = r \le \min(p, q)\]

where

\[C = B_\star A_\star^T\,, B_\star\in \IR^{p\times r}, A_\star\in \IR^{q\times r}\]

it is possible to treat $r$ as a parameter and assign it a prior distribution inside a hierarchical formulation, posterior inference on $r$ requires calculation of intractable marginal likelihoods or resorting to RJMCMC.

To avoid specifying a prior on $r$, the paper works within a parameter-expanded framework to

  • consider a potentially full-rank decomposition $C=BA^T$ with $B\in \IR^{p\times q}, A\in \IR^{q\times q}$,
  • assign shrinkage priors to $A$ and $B$ to shrink out the redundant columns when $C$ is indeed low rank.

Consider independent standard normal priors on the entries of $A$ - use $\Pi_A$ to denote the prior on $A$, i.e., $a_{hk}\sim N(0, 1)$ independently for $h, k=1,\ldots,q$ - alternatively, a uniform prior on the Stiefel manifold of orthogonal matrices can be used, but it is slow.

Use independent horseshoe priors on the columns of $B$, and denote it by $\Pi_B$ - stronger shrinkage is warranted on the columns of $B$

\[b_{jh}\mid \lambda_{jh}, \tau_h \sim N(0, \lambda_{jh}^2\tau_h^2),\quad \lambda_{jh}\sim Ca_+(0, 1), \quad \tau_h\sim Ca_+(0, 1)\]

independently for $j=1,\ldots,p$ and $h=1,\ldots,q$, where $Ca_+(0,1)$ denotes the truncated standard half-Cauchy distribution with density proportional to $(1+t^2)^{-1}1_{(0,\infty)}(t)$

primarily restrict the attention to settings where $\Sigma$ is diagonal, $\Sigma=\diag(\sigma_1^2,\ldots,\sigma_q^2)$, and assign independent improper priors $\pi(\sigma_h^2)\propto \sigma_h^{-2} (h=1,\ldots,q)$ on the diagonal elements

Then the model becomes

\[Y = XBA^T + E, e_i\sim N(0, \Sigma)\]

where

\[B\sim \Pi_B, A\sim \Pi_A, \Sigma\sim \Pi_\Sigma\]

The likelihood of $(C,\Sigma)$ is

\[p^{(n)}(Y\mid C, \Sigma; X) \propto \vert \Sigma\vert^{-n/2}\exp(-\trace\{(Y-XC)\Sigma^{-1}(Y-XC)^T\}/2)\]

Published in categories Note