Bayesian Sparse Multiple Regression
Posted on 0 Comments
In the context of multiple response regression, a popular technique to achieve parsimony and interpretability is to consider a reduced-rank decomposition of the coefficient matrix, commonly known as reduced rank regression.
Let $X\in \IR^{n\times p}, Y\in \IR^{n\times q}$, consider the multivariate linear regression model
\[Y = XC + E\,, \quad E=(e_1,\ldots, e_n)^T\]where
- the response has been centred
- no intercept term
- the rows of the error matrix are independent, with $e_i\sim N(0, \Sigma)$
- the high-dimensional case, $p > \max(n, q)$
- the dimension of the response $q$ to be modest relative to the sample size
Basic assumption in reduced rank regression is
\[\rank(C) = r \le \min(p, q)\]where
\[C = B_\star A_\star^T\,, B_\star\in \IR^{p\times r}, A_\star\in \IR^{q\times r}\]it is possible to treat $r$ as a parameter and assign it a prior distribution inside a hierarchical formulation, posterior inference on $r$ requires calculation of intractable marginal likelihoods or resorting to RJMCMC.
To avoid specifying a prior on $r$, the paper works within a parameter-expanded framework to
- consider a potentially full-rank decomposition $C=BA^T$ with $B\in \IR^{p\times q}, A\in \IR^{q\times q}$,
- assign shrinkage priors to $A$ and $B$ to shrink out the redundant columns when $C$ is indeed low rank.
Consider independent standard normal priors on the entries of $A$ - use $\Pi_A$ to denote the prior on $A$, i.e., $a_{hk}\sim N(0, 1)$ independently for $h, k=1,\ldots,q$ - alternatively, a uniform prior on the Stiefel manifold of orthogonal matrices can be used, but it is slow.
Use independent horseshoe priors on the columns of $B$, and denote it by $\Pi_B$ - stronger shrinkage is warranted on the columns of $B$
\[b_{jh}\mid \lambda_{jh}, \tau_h \sim N(0, \lambda_{jh}^2\tau_h^2),\quad \lambda_{jh}\sim Ca_+(0, 1), \quad \tau_h\sim Ca_+(0, 1)\]independently for $j=1,\ldots,p$ and $h=1,\ldots,q$, where $Ca_+(0,1)$ denotes the truncated standard half-Cauchy distribution with density proportional to $(1+t^2)^{-1}1_{(0,\infty)}(t)$
primarily restrict the attention to settings where $\Sigma$ is diagonal, $\Sigma=\diag(\sigma_1^2,\ldots,\sigma_q^2)$, and assign independent improper priors $\pi(\sigma_h^2)\propto \sigma_h^{-2} (h=1,\ldots,q)$ on the diagonal elements
Then the model becomes
\[Y = XBA^T + E, e_i\sim N(0, \Sigma)\]where
\[B\sim \Pi_B, A\sim \Pi_A, \Sigma\sim \Pi_\Sigma\]The likelihood of $(C,\Sigma)$ is
\[p^{(n)}(Y\mid C, \Sigma; X) \propto \vert \Sigma\vert^{-n/2}\exp(-\trace\{(Y-XC)\Sigma^{-1}(Y-XC)^T\}/2)\]