spaCRT: saddlepoint approximation-based conditional randomization test
Posted on
spaCRT: saddlepoint approximation-based conditional randomization test
balances statistical accuracy and computational efficiency, inspired by applications to single-cell CRISPR screens
- resampling-based methods like the distilled conditional randomization test (dCRT) offer statistical precision but at a high computation cost
- leverages a saddlepoint approximation to the resampling distribution of the dCRT test statistic, achieving very similar finite-sample statistical performance with significantly reduced computational demands
prove that the spaCRT p-value approximates the dCRT p-value with vanishing relative error, and that these two tests are asymptotically equivalent
Motivating application: Single-cell CRISPR screens
- it is of interest to determine which regulatory elements control the expressions of which genes
- single-cell CRISPR screens are designed to subject a population of cells to a large number of CRISPR perturbations, each of which inhibits the functioning of a specific regulatory element
- each cell receives several of these perturbations, and the expression of each gene in each cell is measured by single-cell RNA-sequencing
statistical analysis task: determine, for each CRISPR perturbation and gene of interest, whether cells with the perturbation have different gene expression levels compared to cells without the perturbation.
- $X\in {0, 1}$: the presence of a CRISPR perturbation
- $Y\in \bbN$: the gene expression level
- $Z\in \bbR^p$: a set of covariates measured in a cell, respectively. It includes technical factors like library size (the total number of RNA molecules sequenced in a cell) and experiment batch, which may impact both $X$ and $Y$
in the joint distribution $(X, Y, Z) \in L_n$ (potentially depending on the number of cells $n$), the statistical task is to test the null hypothesis
\[H_0: X\ind Y\mid Z\]the statistical task is to test the null hypothesis
\[H_0: X\ind Y\mid Z\]the presence of the CRISPR perturbation is conditionally independent of the gene expression level, given the covariates
collect observations $(X_{in}, Y_{in}, Z_{in})\sim L_n$ for cells $i=1,\ldots, n$
denote these observations collectively as $X\in \IR^n, Y\in \IR^n$
challenge:
- both $X$ and $Y$ are highly sparse
- gene expression data are measured as RNA molecule counts, and when measured at single-cell resolution
testing strategies
working model:
- $X_i\mid Z_i \sim Ber(\pi_i)$
- $Y_i\mid X_i, Z_i \sim NegBin(\mu_i, \theta)$, $\log\mu_i = X_i\beta + Z_i^\top\gamma$
three methods
- score test
- GCM test
- dCRT
key idea: approximate the distribution of the resampled test statistic in the dCRT using a saddlepoint approximation (SPA), a classical technique to obtain highly accurate approximations to densities and tail probabilities for quantities that can be expressed as sample averages
Related work
in dCRT paper, a resampling-free approximation was proposed based on a quantile transformation to a normal distribution, but this approach is primarily useful for continuously distributed $X$, and that it incurs a substantial power loss for discrete $X$
SPAs have been proposed to approximate resampling distributions of other resampling-based procedures, like
- permutation tests
- the bootstrap
dCRT
consider the special case
\[L_n(X\mid Z) = f(X\mid \theta_{n,x}(Z))\]where $f(x\mid\theta)$ is an exponential family with natural parameter $\theta$, natural parameter space $\IR$, and log-partition function $A$:
\[f(x\mid \theta) = \exp(\theta x-A(\theta)) h(x)\]consider estimating the functions $\theta_{n, x}(Z)$ and $\mu_{n, y}(Z) = \bbE_{L_n}[Y\mid Z]$ by $\hat\theta_{n,x}(Z)$ and $\hat\mu_{n, y}(Z)$, respectively
setting $\hat\mu_{n,x}(Z) \equiv A’(\hat\theta_{n,x}(Z))$, we arrive at the test statistic
\[T_n^{dCRT}(X, Y, Z) = \frac{1}{n} \sum_{i=1}^n (X_{in} - \hat \mu_{n,x}(Z_{in}))(Y_{in} - \hat \mu_{n, y}(Z_{in}))\]spaCRT
consider the limit of the dCRT $p$-value as the number of resamples $M$ grows,
\[p_{dCRT} \equiv P\left[T_n^{dCRT}(\tilde X, X, Y, Z) \ge T_n^{dCRT}(X, Y, Z) \mid X, Y, Z \right]\]approximate this conditional tail probability via the SPA