# CountSplit for scRNA Data

##### Posted on (Update: )

- data matrix: $X$ has dimension $n\times p$
- $X$ is a realization of a random variable $\bfX$
- the biological variation in $\bbE[\bfX]$ is explained by a set of latent variables $L\in \IR^{n\times k}$

**Goal**: which columns $\bfX_j$ of $\bfX$ are associated with $L$.

As $L$ is observed, here is a natural two-step procedure, which is also called **double dipping**:

- latent variable estimation: use $X$ to compute $\hat L(X)$, an estimate of $L$
- differential expression analysis: for $j=1,\ldots, p$, test for association between $\bfX_j$ and the columns of $\hat L(X)$

Assume that the entries in $X$ are independent, with

\[E[X_{ij}] = \gamma_j\Lambda_{ij}, \quad\log(\Lambda_{ij}) = \beta_{0j} + \beta_{1j}^TL_i,\quad i=1,\ldots, n, j = 1,\ldots, p\,,\]where $\gamma_1,\ldots, \gamma_n$ are cell-specific size factors that reflect technical variation in capture efficiency of the mRNA molecules between cells, the $n\times p$ matrix $\Lambda$ represent biological variation, and the matrix $L\in \IR^{n\times k}$ contains the unobserved latent variables.

For any $Z\in \IR^{n\times k}$ and random variable $\bfX\in \IR^{n\times p}$, we define the population parameters targeted by fitting a GLM with a log link function to predict $X_j$ (draw from $\bfX_j$) using $Z$, with $\gamma_i$ included as offsets, to be

\[(\beta_0(Z,\bfX_j), \beta_1(Z,\bfX_j)) = \argmax_{\alpha_0, \alpha_1} \left( \bbE_{\bfX_{1j}, \ldots, \bfX_{nj}}\left[\sum_{i=1}^n\log\left(p_{\gamma_i\exp(\alpha_0+\alpha_1^TZ_i)}(\bfX_{ij}) \right) \right] \right)\]we say that the $j$-th gene is differentially expressed across a variable $Z$ if $\beta_1(Z,\bfX_j) \neq 0$.

denote the coefficient estimates that result from fitting a GLM with a log link function to predict a realized $X_j$ using $Z$, with the $\gamma_i$ included as offsets, as

\[(\hat\beta_0(Z, X_j), \hat\beta_1(Z, X_j)) = \argmax_{\alpha_0, \alpha_1}\sum_{i=1}^n \log\left(p_{\gamma_i\exp(\alpha_0+\alpha_1^TZ_i)}(X_{ij})\right)\]For the majority of the paper, assume the model

\[X_{ij} \sim_{iid} \text{Poisson}(\gamma_i\Lambda_{ij})\]**double dipping** method: using the same data fro latent variable estimation and differential expression analysis without correcting for this double use. The resulting p-value is

**NB:** $\hat \beta_1(\hat L(X), X_j)$ is drawn from $\hat\beta_1(\hat L(\bfX), \bfX_j)$, not $\hat\beta_1(\hat L(X), \bfX_j)$.

## Count Splitting

For a constant $\epsilon$ with $0 < \epsilon < 1$, draw $\bfX_{ij}^{train}\mid \bfX_{ij} = X_{ij} \sim_{ind.} Binomial(X_{ij}, \epsilon)$, and let $X^{test} = X - X^{train}$.

If $\bfX_{ij}\sim Poisson(\gamma_i\Lambda_{ij})$, then $\bfX_{ij}^{train}$ and $\bfX_{ij}^{test}$ are independent. Furthermore, $\bfX_{ij}^{train}\sim Poisson(\epsilon \gamma_i\Lambda_{ij})$ and $X_{ij}^{test}\sim Poisson((1-\epsilon)\gamma_i\Lambda_{ij})$.

The parameter $\epsilon$ governs a tradeoff between the information available for estimating $L$ and the information available for carrying out inference.

If $\bfX_{ij} \sim Poisson(\gamma_i\Lambda_{ij})$, then $Cor(\bfX_{ij}, \bfX_{ij}^{train})=\sqrt{\epsilon}$.

Thus, as $\epsilon$ decreases, $\hat L(X^{train})$ and $\hat L(X)$ to look similar. This is a drawback, as scientists would ideally like to estimate $L$ using all of the data. However, as $\epsilon$ increases, the power to reject false null hypotheses decreases.

If $\bfX_{ij}\sim_{ind} Poisson(\gamma_i\exp(\beta_{0j}+\beta_{1j}L_i))$. Then $Var(\hat \beta_1(L, \bfX_j^{test}))\approx \frac{1}{1-\epsilon}\Var(\hat\beta_1(L, \bfX_j))$.

In the ideal setting where $\hat L(X^{train})=L$ and $L\in \IR^{n\times 1}$ for simplicity, using $\bfX_j^{test}$ rather than $\bfX_j$ as the response inflates the variance of the estimated coefficient by a factor of $1/(1-\epsilon)$.