scDesign3: A Single-cell Simulator

Posted on Oct 10, 2022

Tags: Single-cell, Copula, Generalized Additive Model

This note is based on Jingyi Jessica Li’s talk on Song, D., Wang, Q., Yan, G., Liu, T., & Li, J. J. (2022). A unified framework of realistic in silico data generation and statistical model inference for single-cell and spatial omics (p. 2022.09.20.508796). bioRxiv.

A unified framework of realistic in silico data generation and statistical model inference for single-cell and spatial omics

Computational challenges in the single-cell and spatial omics field

method benchmarking
data interpretation
in silico data generation

Propose an all-in-one statistical simulator, scDesign3, to generate realistic single-cell and spatial omics data by learning interpretable parameters from real datasets

various cell states
experimental designs
feature modalities

Furthermore, using a unified probabilistic model for single-cell and spatial omics data, scDesign3 can

infer biologically meaningful parameters
assess the quality of cell clusters and trajectories
generate in silico negative and positive controls for benchmarking computational tools

single-cell technologies:

single-cell RNA-seq (scRNA-seq): enable the measurement of transcriptome-wide gene expression levels and the discovery of novel cell types and continuous cell trajectories
single-cell chromatin accessibility (scATAC-seq and sci-ATAC-seq) check https://bioconductor.org/packages/release/data/experiment/vignettes/scATAC.Explorer/inst/doc/scATAC.Explorer.html for some illustration on the data of scATAC-seq
single-cell DNA methylation
single-cell protein abundance

more recently, single-cell multi-omics to simultaneously measure more than one modality,

SNARE-seq: gene expression and chromatin accessibility
CITE-seq: gene expression and surface protein abundance

spatial transcriptomics technologies: profile gene expression to levels with spatial location information of

cell neighborhoods
individual cells
sub-cellular components

fair benchmarking relies on comprehensive evaluation metrics that reflect real data analytical goals, but meaningful metrics usually require ground truths that are rarely available in real data

an example: most real datasets contain “cell types” obtained by cell clustering and manual annotation without external validation; using such “cell types” as ground truths would biasedly favor the clustering method used in the original study

scDesign3 can generate realistic synthetic data from diverse settings, including

cell latent structures (discrete cell types and continuous cell trajectories)
feature modalities (gene expression, chromatin accessibility, methylation, protein abundance, and multi-omics)
spatial coordinates
experimental designs (batches and conditions)

scDesign2 is a special case of scDesign3 for generating scRNA-seq data from discrete cell types

verify scDesign3 as a realistic and versatile simulator in four exemplar settings

scRNA-seq data of continuous cell trajectories
spatial transcriptomics data
single-cell epigenomics data
single-cell multi-omics data

under each setting, show that the synthetic data resemble the test data

mLISI (mean Local Inverse Simpson’s Index): indicates the degree of similarity between synthetic and real cell cells, and has a perfect value of 2.
r: Pearson correction coefficients

Marginal Distributions: GAMLSS

$\bfY_{n\times m}$: cell-by-feature matrix
$\bfX_{n\times p}$: cell-by-state covariate matrix
$\bfZ_{n\times q}$: cell-by-design covariate matrix. Consider $\bfZ=[\bfb, \bfc]$, where $\bfb$ represents the batch, and $\bfc$ represents the condition of cell.

$F_j$ is fitting by gamlss::gamlss() or mgcv::gam().

Joint Distributions: Copula

The joint cumulative distribution function $F(\cdot\mid \bfx_i,\bfz_i)$ is modeled from the marginal CDFs $F_j(\cdot\mid \bfx_i,\bfz_i)$ using the copula: $C(\cdot\mid \bfx_i,\bfz_i): [0,1]^m\rightarrow [0, 1]$:

\[F(\bfy_i\mid \bfx_i,\bfz_i) = C(F_1(y_{i1}\mid \bfx_i,\bfz_i),\ldots, F_m(y_{im}\mid \bfx_i,\bfz_i)\mid \bfx_i,\bfz_i)\mid \bfx_i,\bfz_i)\]

The copula can be

Gaussian copula

vine copula: a high-dimensional copula into a sequence of low-dimensional copulas

Published in categories Note

← previous next →

See all posts →

WeiYa's Work Yard

A traveler with endless curiosity, who fell into the ocean of statistics, tries to write down his ideas and notes to save himself.

scDesign3: A Single-cell Simulator

Posted on Oct 10, 2022

Marginal Distributions: GAMLSS

Joint Distributions: Copula