Additive Bayesian Variable Selection
Posted on
This post is based on Rossell, D., & Rubio, F. J. (2019). Additive Bayesian variable selection under censoring and misspecification. ArXiv:1907.13563 [Math, Stat].
Study the effect and interplay of two important issues on Bayesian model selection (BMS):
- the presence of censoring
- model misspecification: assuming the wrong model or functional effect on the response, or not recording truly relevant covariates.
focus on additive accelerated failure time (AAFT):
study BMS under different priors:
- local priors
- a prior structure that combines local and non-local priors as a means of enforcing sparsity
BMS asymptotically chooses the model of smallest dimension minimizing Kullback-Leibler divergence to the data-generating truth.
under mild conditions, such model contains any covariate that has predictive power for either the outcome or censoring times, and discards the remaining covariates.
characterize asymptotic Bayes factor rates and help interpret the role of censoring and misspecification: both have an asymptotically negligible effect on false positives, but their impact on power is exponential.
Introduction
PH (Proportional hazards) and AFT (Accelerated failure time) can be combined with Bayesian model selection (BMS) to identify relevant variables, enforce sparsity and quantify uncertainty on having selected the right variables and on the form of their effect.
Goal: help better understand the consequences of three important issues on BMS for survival data: misspecification, censoring and trades-offs associated to including non-linear effects
Focus: a possibly misspecified non-linear additive AFT (AAFT) model. Motivated by its leading to a simple interpretation of the effects of censoring and misspecification, allowing us to extend directly our results to probit binary regression, and that in misspecified settings it has been argued to be preferable to the PH model.
Results:
- under mild assumptions AFT- and probit-based BMS asymptotically selects variables that improve either the mean squared error to predict the outcome or the predicted censoring probability, i.e. variables that do not predict neither are discarded by BMS.
- Both censoring and wrongly specifying covariate effects has an exponential effect in power, but that asymptotically neither leads to false positive inflation.
- develop novel methodology to ameliorate the power drop by learning from the data if non-linear effects are actually needed, embedded in an also novel combination of non-local priors and group-Zellner priors designed to induce group-level sparsity for non-linear effects.
Notation for survival data:
- $o_i\in\bbR^+$: survival times
- $c_i\in\bbR^+$: right-censoring times
- $x_i$: a covariate vector
- $d_i =I(o_i < c_i)$: censoring indicators
- $y_i = \min{\log(o_i),\log(c_i)}$: the observed log-times
A standard AAFT model postulates
\[\log(o_i) = \sum_{j=1}^pg_j(x_{ij}) + \epsilon_i\,,\]where $g_j:\bbR\rightarrow\bbR$ belong to a suitable function space and $\epsilon_i$ are independent across $i=1,\ldots,n$ with mean $E(\epsilon_i)=0$ and variance $V(\epsilon_i)=\sigma^2$.
Here consider
\[\log(o_i) = x_i^T\beta + s_i^T\beta + \epsilon_i\,,\]where
- $\beta = (\beta_1,\ldots,\beta_p)^T\in\bbR^p$
- $\delta^T=(\delta_1^T,\ldots,\delta_p^T)\in\bbR^{rp}$
- $s_i^T=(s_{i1}^T,\ldots,s_{ip}^T)$ where $s_{ij}\in\bbR^r$ is the projection of $x_{ij}$ onto a cubic spline basis orthogonalized w.r.t. $x_{ij}$ (and the intercept)
Let $X_j$ and $\tilde S_j$ be the matrices with $i$-th rows given by $(1,x_{ij})$ and the cubic spline projection of $x_{ij}$, then $S_j=(I-X_j(X_j^TX_j)^{-1}X_j^T)\tilde S_j$. something like the regression, the residual vector is orthogonal to the input_{:.comment}
- $\Gamma\subset \bbR^{p(r+1)}\times \bbR^+$: the parameter space
- $(X, S)$: the design matrix with $(x_i^T,s_i^T)$ in its $i$-th row
- $(X_o,S_o)$ and $(X_c,S_c)$: the submatrices containing the rows for uncensored and censored individuals (respectively)
Literatures:
- parameter estimation for misspecified AFT and PH models: although these models hae similar asymptotic properties and which of them is more appropriate depends on the actual datasets at hand, misspecified AFT inference has been argued to be more robust and to better preserve interpretability.
- variable selection:
- the effect of misspecification when using information criteria
- likelihood penalities: LASSO, Ridge, SCAD, and Elastic Net penalities on the Cox and semiparametric AFT models
- nonparametric variable selection via random survival forests.
- Bayesian context:
- shrinkage priors for the Cox and AFT models;
- BMS for the additive hazards model
- Bayesian non-parametric AFT errors and Laplace priors
- structured additive regression (STAR) models coupled with spike-and-slab priors
- non-local priors in the Cox model
Formalize as choosing among the three possibilities
\[\gamma_j = \begin{cases} 0, &\text{if }\beta_j=0,\delta_j=0\\ 1, &\text{if }\beta_j\neq 0,\delta_j=0\\ 2, &\text{if }\beta_j\neq 0,\delta_j\neq 0 \end{cases}\]this formulation has two key ingredients
- it enforces the standard hierarchical desiderate that non-linear terms are only included if the linear terms are present related to reluctant interaction
- consider a joint group inclusion/exclusion of all columns in $S_j$ associated to the non-linear effect of covariate $j$ (motivation that allowing the inclusion of individual entries in $\delta_j$ increases the probability of false positives)
notation:
- $p_\gamma = \sum_{j=1}^pI(\gamma_j\neq 0)$: the number of active variables
- $s_\gamma = \sum_{j=1}^pI(\gamma_j=2)$: the number that have non-linear effects
- $d_\gamma = p_\gamma + rs_\gamma + 1$: the total number of parameters in model $\gamma$
- $(X_\gamma,S_\gamma), (\beta_\gamma,\delta_\gamma)$: the corresponding submatrices of $(X,S)$ and subvectors of $(\beta,\delta)$
- $(X_{o,\gamma},S_{o,\gamma}), (X_{c,\gamma}, S_{c,\gamma})$: submatrices of $(X_o,S_o)$ and $(X_c, S_c)$.
A standard BMS setting based on posterior model probabilities
\[p(\gamma\mid y) = \frac{p(y\mid \gamma)p(\gamma)}{\sum_\gamma p(y\mid\gamma)p(\gamma)} = \left(1+\sum_{\gamma'\neq \gamma}B_{\gamma',\gamma}\frac{p(\gamma')}{p(\gamma)}\right)^{-1}\,,\]Two standard variable selection strategies are either choosing the model with highest $p(\gamma\mid y)$, or including variables based on marginal posterior probabilities $P(\gamma_j\neq 0\mid y)$ how to guarantee sparsity