Comparisons of transformations for single-cell RNA-seq data
Posted on
This post is for Ahlmann-Eltze, C., & Huber, W. (2023). Comparison of transformations for single-cell RNA-seq data. Nature Methods, 20(5), 665–672.
For the count table, a numeric matrix of genes x cells, a common preprocessing step is to adjust the counts for variable sampling efficiency and to transform them so that the variance is similar across the dynamic range.
four transformation approaches:
- delta method
- model residuals
- inferred latent expression state
- factor analysis
the latter three have appealing theoretical properties; however, in benchmarks using simulated and real-world data, it turns out that a rather simple approach, namely, the logarithm with a pseudo-count followed by principal-component analysis, performs as well or better than the more sophisticated alternatives.
Single-cell RNA-seq count tables are heteroskedastic. In particular, counts for highly expressed genes vary more than for lowly expressed genes.
One approach is to explicitly model the sampling distributions.
Alternatively, use variance-stabilizing transformations as a preprocessing step and subsequently use the many existing statistical methods that implicitly or explicitly assume uniform variance.
Instead of working with the raw counts $Y$, we apply a non-linear function $g(Y)$ designed to make the variances (and possibly, higher moments) more similar across the dynamic range of the data
The gamma-Poisson distribution with mean $\mu$ and overdisperion $\alpha$ implies a quadratic mean-variance relationship $Var(Y)=\mu+\alpha\mu^2$.
given such a mean-variance relationship, applying the delta method produces the variance-stabilizing transformation
\[g(y) = \frac{1}{\sqrt\alpha}acosh(2\alpha y+1)\]Practitioners often use the shifted logarithm
\[g(y) = log(y + y_0)\]an additional requirement is posed by experimental variations in sampling efficiency and different cell sizes, which manifest themselves in varying total numbers of UMIs per cell.
Commonly, a so-called size factor $s$ is determined for each cell and the counts are divided by it before applying the variance-stabilizing transformation: for example, $\log(y/s + y_0)$
a variety of approaches to estimate size factors from the data
conventionally, they are scaled to be close to 1 (for example, by dividing them by their mean), such that the range of the adjusted counts is about the same as that of the raw counts.
The simplest estimate of the size factor for cell $c$ is
\[s_c = \frac{\sum_g y_{gc}}{L}\]where the numerator is the total number of UMIs for cell $c, g$ indexes the genes and $L=(no.cells)^{-1}\sum_{gc}y_{gc}$ is the average across all cells of these numerators.
Sometimes, a fixed value is used instead for $L$
- Seurat uses $L=10000$,
- others have used $L=10^6$ calling the resulting values $y_{gc}/s_c$ counts per million (CPM)
Hafemeister and Satija suggested a different approach to variance stabilization based on Pearson residuals
\[r_{gc} = \frac{y_{gc}-\hat\mu_{gc}}{\sqrt{\hat\mu_{gc} + \hat\alpha_g\hat\mu_{gc}^2}}\]where $\hat\mu_{gc}$ and $\hat\alpha_g$ come from fitting a gamma-Poisson GLM,
\[\begin{align*} Y_{gc} &\sim \text{gamma-Poisson}(\mu_{gc}, \alpha_g) \\ \log\mu_{gc} &= \beta_{g, intercept} + \beta_{g, slope}\log s_c\,. \end{align*}\]A third set of transformations infers the parameters of a postulated generative model, aiming to estimate so-called latent gene expression values based on the observed counts
- Sanity: A prominent instance of this approach is Sanity, a fully Bayesian model for gene expression. It infers latent gene expression using a method that resembles a variational mean-field approximation for a log-normal Poisson mixture model.
- Sanity Distance calculates the mean and s.d. of the posterior distribution of the logarithmic gene expression; based on these, it calculates all cell-by-cell distances, from which it can find the k-nearest neighbors of each cell
- Sanity MAP ignores the inferred uncertainty and returns the maximum of the posterior as the transformed value
- Dino: fits mixtures of gamma-Poisson distributions and returns random samples from the posterior
- Normalisr: primarily designed fro frequentist hypothesis testing, but as it infers logarithmic latent gene expression, it might also serve as a generic preprocessing method
- return the minimum mean square error estimate for each count assuming a binomial generative model.
A fourth preprocessing approach that is not transformation-based and directly produces a low-dimensional latent space representation of the cells: factor analysis for count data based on the gamma-Poisson sampling distribution.