scDesign3: A Single-cell Simulator
Posted on
This note is based on Jingyi Jessica Li’s talk on Song, D., Wang, Q., Yan, G., Liu, T., & Li, J. J. (2022). A unified framework of realistic in silico data generation and statistical model inference for single-cell and spatial omics (p. 2022.09.20.508796). bioRxiv.
A unified framework of realistic in silico data generation and statistical model inference for single-cell and spatial omics
Computational challenges in the single-cell and spatial omics field
- method benchmarking
- data interpretation
- in silico data generation
Propose an all-in-one statistical simulator, scDesign3, to generate realistic single-cell and spatial omics data by learning interpretable parameters from real datasets
- various cell states
- experimental designs
- feature modalities
Furthermore, using a unified probabilistic model for single-cell and spatial omics data, scDesign3 can
- infer biologically meaningful parameters
- assess the quality of cell clusters and trajectories
- generate in silico negative and positive controls for benchmarking computational tools
single-cell technologies:
- single-cell RNA-seq (scRNA-seq): enable the measurement of transcriptome-wide gene expression levels and the discovery of novel cell types and continuous cell trajectories
- single-cell chromatin accessibility (scATAC-seq and sci-ATAC-seq) check https://bioconductor.org/packages/release/data/experiment/vignettes/scATAC.Explorer/inst/doc/scATAC.Explorer.html for some illustration on the data of scATAC-seq
- single-cell DNA methylation
- single-cell protein abundance
more recently, single-cell multi-omics to simultaneously measure more than one modality,
- SNARE-seq: gene expression and chromatin accessibility
- CITE-seq: gene expression and surface protein abundance
spatial transcriptomics technologies: profile gene expression to levels with spatial location information of
- cell neighborhoods
- individual cells
- sub-cellular components
fair benchmarking relies on comprehensive evaluation metrics that reflect real data analytical goals, but meaningful metrics usually require ground truths that are rarely available in real data
- an example: most real datasets contain “cell types” obtained by cell clustering and manual annotation without external validation; using such “cell types” as ground truths would biasedly favor the clustering method used in the original study
scDesign3 can generate realistic synthetic data from diverse settings, including
- cell latent structures (discrete cell types and continuous cell trajectories)
- feature modalities (gene expression, chromatin accessibility, methylation, protein abundance, and multi-omics)
- spatial coordinates
- experimental designs (batches and conditions)
scDesign2 is a special case of scDesign3 for generating scRNA-seq data from discrete cell types
verify scDesign3 as a realistic and versatile simulator in four exemplar settings
- scRNA-seq data of continuous cell trajectories
- spatial transcriptomics data
- single-cell epigenomics data
- single-cell multi-omics data
under each setting, show that the synthetic data resemble the test data
mLISI
(mean Local Inverse Simpson’s Index): indicates the degree of similarity between synthetic and real cell cells, and has a perfect value of 2.r
: Pearson correction coefficients
Marginal Distributions: GAMLSS
- $\bfY_{n\times m}$: cell-by-feature matrix
- $\bfX_{n\times p}$: cell-by-state covariate matrix
- $\bfZ_{n\times q}$: cell-by-design covariate matrix. Consider $\bfZ=[\bfb, \bfc]$, where $\bfb$ represents the batch, and $\bfc$ represents the condition of cell.
$F_j$ is fitting by gamlss::gamlss()
or mgcv::gam()
.
Joint Distributions: Copula
The joint cumulative distribution function $F(\cdot\mid \bfx_i,\bfz_i)$ is modeled from the marginal CDFs $F_j(\cdot\mid \bfx_i,\bfz_i)$ using the copula: $C(\cdot\mid \bfx_i,\bfz_i): [0,1]^m\rightarrow [0, 1]$:
\[F(\bfy_i\mid \bfx_i,\bfz_i) = C(F_1(y_{i1}\mid \bfx_i,\bfz_i),\ldots, F_m(y_{im}\mid \bfx_i,\bfz_i)\mid \bfx_i,\bfz_i)\mid \bfx_i,\bfz_i)\]The copula can be
- Gaussian copula
- vine copula: a high-dimensional copula into a sequence of low-dimensional copulas