sctransform: Normalization using Regularized Negative Binomial Regression

Posted on Feb 24, 2024 (Update: Mar 07, 2024)

scRNA-seq data exhibits significant cell-to-cell variation due to technical factors, including

the number of molecules detected in each cell, which can confound biological heterogeneity with technical effects

to address this issue, the paper presents a modeling framework for the normalization and variance stablization of molecular count data from scRNA-seq experiments

propose that the Pearson residuals from “regularized negative binomial regression”, where cellular sequencing depth is utilized as a covariate in a generalized linear model, successfully remove the influence of technical characteristics from downstream analysis while preserving biological heterogeneity
show that an unconstrained negative binomial model may overfit scRNA-seq data, and overcome this by pooling information across genes with similar abundance to obtain stable parameter estimates
omits the need for heuristic steps including pseudocount addition or log-transformation and improves common downstream analytical tasks such as variable gene selection, dimensional reduction and differential expression

Introduction

while unsupervised analysis of single-cell data has transformative potential to uncover heterogeneous cell types and states, cell-to-cell variation in technical factors can also confound these results
- the observed sequencing depth (number of genes or molecules detected per cell) can vary significantly between cells, with variation in molecular counts potentially spanning an order of magnitude, even within the same cell type

an effective normalization workflow should have the following characteristics

the normalized expression level of a gene should not be correlated with the total sequencing depth of a cell. Downstream analytical tasks (dimensional reduction, differential expression) should also not be influenced by variation in sequencing depth
the variance of a normalized gene (across cells) should primarily reflect biological heterogeneity, independent of gene abundance or sequencing depth
- genes with high variance after normalization should be differentially expressed across cell types, while housekeeping genes should exhibit low variance.
- the variance of a gene should be similar when considering either deeply sequenced cells, or shallowly sequenced cells

two distinct sets of approaches for the normalization

first set aims to identify “size factors” for individual cells, as for bulk RNA-seq
- BASiCS
- Scran: pool cells with similar library sizes and uses the summed expression values to estimate pool-based size factors
model molecule counts using probabilistic approaches
- initail strategies focused on read-level (instead of UMI-level) data and modeled the measurement of each cell as a mixture of two components: a negative binomial (NB) “signal” component and a Poisson “dropout” component
- for newer measurements based on UMI, modeling strategies have focused primarily on the use of NB distribution, potentially including an additional parameter to model zero-inflation (ZINB)
- ZINB-WaVE models counts as ZINB in a special variant of factor analysis.
- scVI and DCA also use the ZINB noise model, either for normalization and dimensionality reduction in Bayesian hierarchical models or for a denoising autoencoder.

this paper presents a novel statistical approach for the modeling, normalization, and variance stabilization of UMI count data for scRNA-seq

first show that different groups of genes cannot be normalized by the sam constant factor
instead propose to construct a generalized linear model (GLM) for each gene with UMI counts as the response and sequencing depth as the explanatory variable
explore potential error models for the GLM and find that the use of unconstrained NB or ZINB models lead to overfitting of scRNA-seq data and a significant dampening of biological variance

pool information across genes with similar abundances, then they can regularize parameter estimates and obtain reproducible error models

Methods

Regularized negative binomial regression

explicitly model the UMI counts for a given gene using a GLM.

use the sum of all molecules assigned to a cell as a proxy for sequencing depth and use this cell attribute in a regression model with negative binomial (NB) error distribution and log link function

thus, for a given gene $i$, we have

\[\log(\bbE[x_i]) = \beta_0 + \beta_1\log_{10} m\,,\]

where

$x_i$ is the vector of UMI counts assigned to gene $i$
$m$ is the vector of molecules assigned to the cells, i.e., $m_j = \sum_ix_{ij}$
the dispersion parameter $\theta$ of the underlying NB distribution is also unknown and needs to be estimated from the data
- use the NB parametrization with mean $\mu$ and variance given as $\mu + \mu^2/\theta$

use a regression model for the UMI counts to correct for sequencing depth differences between cells and to standardize the data

however, modeling each gene separately results in overfitting, particularly for low-abundance genes that are detected in only a mirror subset of cells and are modeled with a high variance

to avoid this overfitting, regularize all model parameters, including the NB dispersion parameter $\theta$, by sharing information across genes

the procedure we developed has three steps

we fit independent regression models per gene
exploit the relationship of model parameter values and gene mean to learn global trends in the data
1. capture these trends using a kernel regression estimate (ksmooth function in R)
2. use a normal kernel and first select a kernel bandwidth using the R function bw.SJ
3. multiply this by a bandwidth adjustment factor (BAF, default value of 3)
4. perform independent regularizations for all parameters
use the regularized regression parameters to define an affine function that transforms UMI counts into Pearson residuals

\[\begin{align*} z_{ij} &= \frac{x_{ij} - \mu_{ij}}{\sigma_{ij}} \\ \mu_{ij} &= \exp(\beta_{0i}+\beta_{1i}\log_{10}m_j)\\ \sigma_{ij} &= \sqrt{\mu_{ij} + \frac{\mu_{ij}^2}{\theta_i}} \end{align*}\]

where

$z_{ij}$ is the Pearson residual of gene $i$ in cell $j$
$x_{ij}$ is the observed UMI count of gene $i$ in cell $j$

the approach was inspired by methods developed for differential expression analysis in bulk RNA-seq data

DEseq uses the negative binomial distribution for read count data and links variance and mean by local regression
DEseq2 extends this approach with Empirical Bayes shrinkage for dispersion estimation
edgeR introduced GLM algorithms and statistical methods for estimating biological variation on a genewise basis and separating it from technical variation.

Geometric mean for genes

the regularization approach aims to pool information across genes with similar average expression.

to avoid the influence of outlier cells and respect the exponential nature of the count distributions, consistently use the geometric mean

Published in categories Note

← previous next →

See all posts →

WeiYa's Work Yard

A traveler with endless curiosity, who fell into the ocean of statistics, tries to write down his ideas and notes to save himself.