WeiYa's Work Yard

A dog, who fell into the ocean of statistics, tries to write down his ideas and notes to save himself.

scDesign3: A Single-cell Simulator

Posted on
Tags: Single-cell, Copula, Generalized Additive Model

This note is based on Jingyi Jessica Li’s talk on Song, D., Wang, Q., Yan, G., Liu, T., & Li, J. J. (2022). A unified framework of realistic in silico data generation and statistical model inference for single-cell and spatial omics (p. 2022.09.20.508796). bioRxiv.

A unified framework of realistic in silico data generation and statistical model inference for single-cell and spatial omics

Computational challenges in the single-cell and spatial omics field

  • method benchmarking
  • data interpretation
  • in silico data generation

Propose an all-in-one statistical simulator, scDesign3, to generate realistic single-cell and spatial omics data by learning interpretable parameters from real datasets

  • various cell states
  • experimental designs
  • feature modalities

Furthermore, using a unified probabilistic model for single-cell and spatial omics data, scDesign3 can

  • infer biologically meaningful parameters
  • assess the quality of cell clusters and trajectories
  • generate in silico negative and positive controls for benchmarking computational tools

single-cell technologies:

more recently, single-cell multi-omics to simultaneously measure more than one modality,

  • SNARE-seq: gene expression and chromatin accessibility
  • CITE-seq: gene expression and surface protein abundance

spatial transcriptomics technologies: profile gene expression to levels with spatial location information of

  • cell neighborhoods
  • individual cells
  • sub-cellular components

fair benchmarking relies on comprehensive evaluation metrics that reflect real data analytical goals, but meaningful metrics usually require ground truths that are rarely available in real data

  • an example: most real datasets contain “cell types” obtained by cell clustering and manual annotation without external validation; using such “cell types” as ground truths would biasedly favor the clustering method used in the original study

scDesign3 can generate realistic synthetic data from diverse settings, including

  • cell latent structures (discrete cell types and continuous cell trajectories)
  • feature modalities (gene expression, chromatin accessibility, methylation, protein abundance, and multi-omics)
  • spatial coordinates
  • experimental designs (batches and conditions)

scDesign2 is a special case of scDesign3 for generating scRNA-seq data from discrete cell types

verify scDesign3 as a realistic and versatile simulator in four exemplar settings

  • scRNA-seq data of continuous cell trajectories
  • spatial transcriptomics data
  • single-cell epigenomics data
  • single-cell multi-omics data


under each setting, show that the synthetic data resemble the test data

  • mLISI (mean Local Inverse Simpson’s Index): indicates the degree of similarity between synthetic and real cell cells, and has a perfect value of 2.
  • r: Pearson correction coefficients

Marginal Distributions: GAMLSS

  • $\bfY_{n\times m}$: cell-by-feature matrix
  • $\bfX_{n\times p}$: cell-by-state covariate matrix
  • $\bfZ_{n\times q}$: cell-by-design covariate matrix. Consider $\bfZ=[\bfb, \bfc]$, where $\bfb$ represents the batch, and $\bfc$ represents the condition of cell.


$F_j$ is fitting by gamlss::gamlss() or mgcv::gam().

Joint Distributions: Copula

The joint cumulative distribution function $F(\cdot\mid \bfx_i,\bfz_i)$ is modeled from the marginal CDFs $F_j(\cdot\mid \bfx_i,\bfz_i)$ using the copula: $C(\cdot\mid \bfx_i,\bfz_i): [0,1]^m\rightarrow [0, 1]$:

\[F(\bfy_i\mid \bfx_i,\bfz_i) = C(F_1(y_{i1}\mid \bfx_i,\bfz_i),\ldots, F_m(y_{im}\mid \bfx_i,\bfz_i)\mid \bfx_i,\bfz_i)\mid \bfx_i,\bfz_i)\]

The copula can be

  • Gaussian copula


  • vine copula: a high-dimensional copula into a sequence of low-dimensional copulas


Published in categories Note