CountSplit for scRNA Data

Posted on Dec 08, 2023 (Update: Dec 30, 2023)

Tags: False Discovery Rate, Differential Expression, Selective Inference, Pseudotime, Poisson, Binomial Thinning, Single-cell

The post is for Neufeld, Anna, Lucy L Gao, Joshua Popp, Alexis Battle, and Daniela Witten. “Inference after Latent Variable Estimation for Single-cell RNA Sequencing Data.” Biostatistics, December 13, 2022, kxac047.

data matrix: $X$ has dimension $n\times p$
$X$ is a realization of a random variable $\bfX$
the biological variation in $\bbE[\bfX]$ is explained by a set of latent variables $L\in \IR^{n\times k}$

Goal: which columns $\bfX_j$ of $\bfX$ are associated with $L$.

As $L$ is observed, here is a natural two-step procedure, which is also called double dipping:

latent variable estimation: use $X$ to compute $\hat L(X)$, an estimate of $L$
differential expression analysis: for $j=1,\ldots, p$, test for association between $\bfX_j$ and the columns of $\hat L(X)$

Assume that the entries in $X$ are independent, with

\[E[X_{ij}] = \gamma_j\Lambda_{ij}, \quad\log(\Lambda_{ij}) = \beta_{0j} + \beta_{1j}^TL_i,\quad i=1,\ldots, n, j = 1,\ldots, p\,,\]

where $\gamma_1,\ldots, \gamma_n$ are cell-specific size factors that reflect technical variation in capture efficiency of the mRNA molecules between cells, the $n\times p$ matrix $\Lambda$ represent biological variation, and the matrix $L\in \IR^{n\times k}$ contains the unobserved latent variables.

For any $Z\in \IR^{n\times k}$ and random variable $\bfX\in \IR^{n\times p}$, we define the population parameters targeted by fitting a GLM with a log link function to predict $X_j$ (draw from $\bfX_j$) using $Z$, with $\gamma_i$ included as offsets, to be

\[(\beta_0(Z,\bfX_j), \beta_1(Z,\bfX_j)) = \argmax_{\alpha_0, \alpha_1} \left( \bbE_{\bfX_{1j}, \ldots, \bfX_{nj}}\left[\sum_{i=1}^n\log\left(p_{\gamma_i\exp(\alpha_0+\alpha_1^TZ_i)}(\bfX_{ij}) \right) \right] \right)\]

we say that the $j$-th gene is differentially expressed across a variable $Z$ if $\beta_1(Z,\bfX_j) \neq 0$.

denote the coefficient estimates that result from fitting a GLM with a log link function to predict a realized $X_j$ using $Z$, with the $\gamma_i$ included as offsets, as

\[(\hat\beta_0(Z, X_j), \hat\beta_1(Z, X_j)) = \argmax_{\alpha_0, \alpha_1}\sum_{i=1}^n \log\left(p_{\gamma_i\exp(\alpha_0+\alpha_1^TZ_i)}(X_{ij})\right)\]

For the majority of the paper, assume the model

\[X_{ij} \sim_{iid} \text{Poisson}(\gamma_i\Lambda_{ij})\]

double dipping method: using the same data fro latent variable estimation and differential expression analysis without correcting for this double use. The resulting p-value is

\[\Pr_{H_0: \beta_1(\hat L(X), \bfX_j) = 0} (\vert \hat\beta_1(\hat L(X), \bfX_j) \ge \vert \hat\beta_1(\hat L(X), X_j) \vert \vert)\]

NB: $\hat \beta_1(\hat L(X), X_j)$ is drawn from $\hat\beta_1(\hat L(\bfX), \bfX_j)$, not $\hat\beta_1(\hat L(X), \bfX_j)$.

Count Splitting

For a constant $\epsilon$ with $0 < \epsilon < 1$, draw $\bfX_{ij}^{train}\mid \bfX_{ij} = X_{ij} \sim_{ind.} Binomial(X_{ij}, \epsilon)$, and let $X^{test} = X - X^{train}$.

If $\bfX_{ij}\sim Poisson(\gamma_i\Lambda_{ij})$, then $\bfX_{ij}^{train}$ and $\bfX_{ij}^{test}$ are independent. Furthermore, $\bfX_{ij}^{train}\sim Poisson(\epsilon \gamma_i\Lambda_{ij})$ and $X_{ij}^{test}\sim Poisson((1-\epsilon)\gamma_i\Lambda_{ij})$.

The parameter $\epsilon$ governs a tradeoff between the information available for estimating $L$ and the information available for carrying out inference.

If $\bfX_{ij} \sim Poisson(\gamma_i\Lambda_{ij})$, then $Cor(\bfX_{ij}, \bfX_{ij}^{train})=\sqrt{\epsilon}$.

Thus, as $\epsilon$ decreases, $\hat L(X^{train})$ and $\hat L(X)$ to look similar. This is a drawback, as scientists would ideally like to estimate $L$ using all of the data. However, as $\epsilon$ increases, the power to reject false null hypotheses decreases.

If $\bfX_{ij}\sim_{ind} Poisson(\gamma_i\exp(\beta_{0j}+\beta_{1j}L_i))$. Then $Var(\hat \beta_1(L, \bfX_j^{test}))\approx \frac{1}{1-\epsilon}\Var(\hat\beta_1(L, \bfX_j))$.

In the ideal setting where $\hat L(X^{train})=L$ and $L\in \IR^{n\times 1}$ for simplicity, using $\bfX_j^{test}$ rather than $\bfX_j$ as the response inflates the variance of the estimated coefficient by a factor of $1/(1-\epsilon)$.

Published in categories Note

← previous next →

See all posts →

WeiYa's Work Yard

A traveler with endless curiosity, who fell into the ocean of statistics, tries to write down his ideas and notes to save himself.

CountSplit for scRNA Data

Posted on Dec 08, 2023 (Update: Dec 30, 2023)

Count Splitting