# scDesign3: A Single-cell Simulator

##### Posted on

This note is based on Jingyi Jessica Li’s talk on Song, D., Wang, Q., Yan, G., Liu, T., & Li, J. J. (2022). A unified framework of realistic in silico data generation and statistical model inference for single-cell and spatial omics (p. 2022.09.20.508796). bioRxiv.

A unified framework of realistic in silico data generation and statistical model inference for single-cell and spatial omics

Computational challenges in the single-cell and spatial omics field

- method benchmarking
- data interpretation
- in silico data generation

Propose an all-in-one statistical simulator, scDesign3, to generate realistic single-cell and spatial omics data by learning interpretable parameters from real datasets

- various cell states
- experimental designs
- feature modalities

Furthermore, using a unified probabilistic model for single-cell and spatial omics data, scDesign3 can

- infer biologically meaningful parameters
- assess the quality of cell clusters and trajectories
- generate in silico negative and positive controls for benchmarking computational tools

single-cell technologies:

- single-cell RNA-seq (scRNA-seq): enable the measurement of transcriptome-wide gene expression levels and the discovery of novel cell types and continuous cell trajectories
- single-cell chromatin accessibility (scATAC-seq and sci-ATAC-seq)
- single-cell DNA methylation
- single-cell protein abundance

more recently, single-cell multi-omics to simultaneously measure more than one modality,

- SNARE-seq: gene expression and chromatin accessibility
- CITE-seq: gene expression and surface protein abundance

spatial transcriptomics technologies: profile gene expression to levels with spatial location information of

- cell neighborhoods
- individual cells
- sub-cellular components

fair benchmarking relies on comprehensive evaluation metrics that reflect real data analytical goals, but meaningful metrics usually require ground truths that are rarely available in real data

- an example: most real datasets contain “cell types” obtained by cell clustering and manual annotation without external validation; using such “cell types” as ground truths would biasedly favor the clustering method used in the original study

scDesign3 can generate realistic synthetic data from diverse settings, including

- cell latent structures (discrete cell types and continuous cell trajectories)
- feature modalities (gene expression, chromatin accessibility, methylation, protein abundance, and multi-omics)
- spatial coordinates
- experimental designs (batches and conditions)

scDesign2 is a special case of scDesign3 for generating scRNA-seq data from discrete cell types

verify scDesign3 as a realistic and versatile simulator in four exemplar settings

- scRNA-seq data of continuous cell trajectories
- spatial transcriptomics data
- single-cell epigenomics data
- single-cell multi-omics data

under each setting, show that the synthetic data resemble the test data

`mLISI`

(mean Local Inverse Simpson’s Index): indicates the degree of similarity between synthetic and real cell cells, and has a perfect value of 2.`r`

: Pearson correction coefficients

## Marginal Distributions: GAMLSS

- $\bfY_{n\times m}$: cell-by-feature matrix
- $\bfX_{n\times p}$: cell-by-state covariate matrix
- $\bfZ_{n\times q}$: cell-by-design covariate matrix. Consider $\bfZ=[\bfb, \bfc]$, where $\bfb$ represents the batch, and $\bfc$ represents the condition of cell.

$F_j$ is fitting by `gamlss::gamlss()`

or `mgcv::gam()`

.

## Joint Distributions: Copula

The joint cumulative distribution function $F(\cdot\mid \bfx_i,\bfz_i)$ is modeled from the marginal CDFs $F_j(\cdot\mid \bfx_i,\bfz_i)$ using the copula: $C(\cdot\mid \bfx_i,\bfz_i): [0,1]^m\rightarrow [0, 1]$:

\[F(\bfy_i\mid \bfx_i,\bfz_i) = C(F_1(y_{i1}\mid \bfx_i,\bfz_i),\ldots, F_m(y_{im}\mid \bfx_i,\bfz_i)\mid \bfx_i,\bfz_i)\mid \bfx_i,\bfz_i)\]The copula can be

- Gaussian copula

- vine copula: a high-dimensional copula into a sequence of low-dimensional copulas