WeiYa's Work Yard

A traveler with endless curiosity, who fell into the ocean of statistics, tries to write down his ideas and notes to save himself.

spaCRT: saddlepoint approximation-based conditional randomization test

Posted on
Tags: Single-cell, Saddlepoint Approximation, Conditional Randomization Test

This note is for Niu, Z., Choudhury, J. R., & Katsevich, E. (2024). Computationally efficient and statistically accurate conditional independence testing with spaCRT (No. arXiv:2407.08911; Version 1). arXiv. https://doi.org/10.48550/arXiv.2407.08911

spaCRT: saddlepoint approximation-based conditional randomization test

balances statistical accuracy and computational efficiency, inspired by applications to single-cell CRISPR screens

  • resampling-based methods like the distilled conditional randomization test (dCRT) offer statistical precision but at a high computation cost
  • leverages a saddlepoint approximation to the resampling distribution of the dCRT test statistic, achieving very similar finite-sample statistical performance with significantly reduced computational demands

prove that the spaCRT p-value approximates the dCRT p-value with vanishing relative error, and that these two tests are asymptotically equivalent

Motivating application: Single-cell CRISPR screens

  • it is of interest to determine which regulatory elements control the expressions of which genes
  • single-cell CRISPR screens are designed to subject a population of cells to a large number of CRISPR perturbations, each of which inhibits the functioning of a specific regulatory element
  • each cell receives several of these perturbations, and the expression of each gene in each cell is measured by single-cell RNA-sequencing

statistical analysis task: determine, for each CRISPR perturbation and gene of interest, whether cells with the perturbation have different gene expression levels compared to cells without the perturbation.

  • $X\in {0, 1}$: the presence of a CRISPR perturbation
  • $Y\in \bbN$: the gene expression level
  • $Z\in \bbR^p$: a set of covariates measured in a cell, respectively. It includes technical factors like library size (the total number of RNA molecules sequenced in a cell) and experiment batch, which may impact both $X$ and $Y$

in the joint distribution $(X, Y, Z) \in L_n$ (potentially depending on the number of cells $n$), the statistical task is to test the null hypothesis

\[H_0: X\ind Y\mid Z\]

the statistical task is to test the null hypothesis

\[H_0: X\ind Y\mid Z\]

the presence of the CRISPR perturbation is conditionally independent of the gene expression level, given the covariates

collect observations $(X_{in}, Y_{in}, Z_{in})\sim L_n$ for cells $i=1,\ldots, n$

denote these observations collectively as $X\in \IR^n, Y\in \IR^n$

challenge:

  • both $X$ and $Y$ are highly sparse
  • gene expression data are measured as RNA molecule counts, and when measured at single-cell resolution

testing strategies

working model:

  • $X_i\mid Z_i \sim Ber(\pi_i)$
  • $Y_i\mid X_i, Z_i \sim NegBin(\mu_i, \theta)$, $\log\mu_i = X_i\beta + Z_i^\top\gamma$

three methods

  • score test
  • GCM test
  • dCRT

key idea: approximate the distribution of the resampled test statistic in the dCRT using a saddlepoint approximation (SPA), a classical technique to obtain highly accurate approximations to densities and tail probabilities for quantities that can be expressed as sample averages

in dCRT paper, a resampling-free approximation was proposed based on a quantile transformation to a normal distribution, but this approach is primarily useful for continuously distributed $X$, and that it incurs a substantial power loss for discrete $X$

SPAs have been proposed to approximate resampling distributions of other resampling-based procedures, like

  • permutation tests
  • the bootstrap

dCRT

consider the special case

\[L_n(X\mid Z) = f(X\mid \theta_{n,x}(Z))\]

where $f(x\mid\theta)$ is an exponential family with natural parameter $\theta$, natural parameter space $\IR$, and log-partition function $A$:

\[f(x\mid \theta) = \exp(\theta x-A(\theta)) h(x)\]

consider estimating the functions $\theta_{n, x}(Z)$ and $\mu_{n, y}(Z) = \bbE_{L_n}[Y\mid Z]$ by $\hat\theta_{n,x}(Z)$ and $\hat\mu_{n, y}(Z)$, respectively

setting $\hat\mu_{n,x}(Z) \equiv A’(\hat\theta_{n,x}(Z))$, we arrive at the test statistic

\[T_n^{dCRT}(X, Y, Z) = \frac{1}{n} \sum_{i=1}^n (X_{in} - \hat \mu_{n,x}(Z_{in}))(Y_{in} - \hat \mu_{n, y}(Z_{in}))\]

image

spaCRT

consider the limit of the dCRT $p$-value as the number of resamples $M$ grows,

\[p_{dCRT} \equiv P\left[T_n^{dCRT}(\tilde X, X, Y, Z) \ge T_n^{dCRT}(X, Y, Z) \mid X, Y, Z \right]\]

approximate this conditional tail probability via the SPA

image


Published in categories