# sctransform: Normalization using Regularized Negative Binomial Regression

##### Posted on (Update: )

scRNA-seq data exhibits significant cell-to-cell variation due to technical factors, including

- the number of molecules detected in each cell, which can confound biological heterogeneity with technical effects

to address this issue, the paper presents a modeling framework for the normalization and variance stablization of molecular count data from scRNA-seq experiments

- propose that the Pearson residuals from “regularized negative binomial regression”, where cellular sequencing depth is utilized as a covariate in a generalized linear model, successfully remove the influence of technical characteristics from downstream analysis while preserving biological heterogeneity
- show that an unconstrained negative binomial model may overfit scRNA-seq data, and overcome this by pooling information across genes with similar abundance to obtain stable parameter estimates
- omits the need for heuristic steps including pseudocount addition or log-transformation and improves common downstream analytical tasks such as variable gene selection, dimensional reduction and differential expression

## Introduction

- while unsupervised analysis of single-cell data has transformative potential to uncover heterogeneous cell types and states, cell-to-cell variation in technical factors can also confound these results
- the observed sequencing depth (number of genes or molecules detected per cell) can vary significantly between cells, with variation in molecular counts potentially spanning an order of magnitude, even within the same cell type

an effective normalization workflow should have the following characteristics

- the normalized expression level of a gene should not be correlated with the total sequencing depth of a cell. Downstream analytical tasks (dimensional reduction, differential expression) should also not be influenced by variation in sequencing depth
- the variance of a normalized gene (across cells) should primarily reflect biological heterogeneity, independent of gene abundance or sequencing depth
- genes with high variance after normalization should be differentially expressed across cell types, while housekeeping genes should exhibit low variance.
- the variance of a gene should be similar when considering either deeply sequenced cells, or shallowly sequenced cells

two distinct sets of approaches for the normalization

- first set aims to identify “size factors” for individual cells, as for bulk RNA-seq
- BASiCS
- Scran: pool cells with similar library sizes and uses the summed expression values to estimate pool-based size factors

- model molecule counts using probabilistic approaches
- initail strategies focused on read-level (instead of UMI-level) data and modeled the measurement of each cell as a mixture of two components: a negative binomial (NB) “signal” component and a Poisson “dropout” component
- for newer measurements based on UMI, modeling strategies have focused primarily on the use of NB distribution, potentially including an additional parameter to model zero-inflation (ZINB)
- ZINB-WaVE models counts as ZINB in a special variant of factor analysis.
- scVI and DCA also use the ZINB noise model, either for normalization and dimensionality reduction in Bayesian hierarchical models or for a denoising autoencoder.

this paper presents a novel statistical approach for the modeling, normalization, and variance stabilization of UMI count data for scRNA-seq

- first show that different groups of genes cannot be normalized by the sam constant factor
- instead propose to construct a generalized linear model (GLM) for each gene with UMI counts as the response and sequencing depth as the explanatory variable
- explore potential error models for the GLM and find that the use of unconstrained NB or ZINB models lead to overfitting of scRNA-seq data and a significant dampening of biological variance

pool information across genes with similar abundances, then they can regularize parameter estimates and obtain reproducible error models

## Methods

### Regularized negative binomial regression

explicitly model the UMI counts for a given gene using a GLM.

use the sum of all molecules assigned to a cell as a proxy for sequencing depth and use this cell attribute in a regression model with negative binomial (NB) error distribution and log link function

thus, for a given gene $i$, we have

\[\log(\bbE[x_i]) = \beta_0 + \beta_1\log_{10} m\,,\]where

- $x_i$ is the vector of UMI counts assigned to gene $i$
- $m$ is the vector of molecules assigned to the cells, i.e., $m_j = \sum_ix_{ij}$
- the dispersion parameter $\theta$ of the underlying NB distribution is also unknown and needs to be estimated from the data
- use the NB parametrization with mean $\mu$ and variance given as $\mu + \mu^2/\theta$

use a regression model for the UMI counts to correct for sequencing depth differences between cells and to standardize the data

however, modeling each gene separately results in overfitting, particularly for low-abundance genes that are detected in only a mirror subset of cells and are modeled with a high variance

to avoid this overfitting, regularize all model parameters, including the NB dispersion parameter $\theta$, by sharing information across genes

the procedure we developed has three steps

- we fit independent regression models per gene
- exploit the relationship of model parameter values and gene mean to learn global trends in the data
- capture these trends using a kernel regression estimate (
`ksmooth`

function in R) - use a normal kernel and first select a kernel bandwidth using the R function
`bw.SJ`

- multiply this by a bandwidth adjustment factor (BAF, default value of 3)
- perform independent regularizations for all parameters

- capture these trends using a kernel regression estimate (
- use the regularized regression parameters to define an affine function that transforms UMI counts into Pearson residuals

where

- $z_{ij}$ is the Pearson residual of gene $i$ in cell $j$
- $x_{ij}$ is the observed UMI count of gene $i$ in cell $j$

the approach was inspired by methods developed for differential expression analysis in bulk RNA-seq data

- DEseq uses the negative binomial distribution for read count data and links variance and mean by local regression
- DEseq2 extends this approach with Empirical Bayes shrinkage for dispersion estimation
- edgeR introduced GLM algorithms and statistical methods for estimating biological variation on a genewise basis and separating it from technical variation.

### Geometric mean for genes

the regularization approach aims to pool information across genes with similar average expression.

to avoid the influence of outlier cells and respect the exponential nature of the count distributions, consistently use the geometric mean