Model-based Approach for Joint Analysis of Single-cell data
This post is based on Lin, Z., Zamanighomi, M., Daley, T., Ma, S., & Wong, W. H. (2020). Model-Based Approach to the Joint Analysis of Single-Cell Data on Chromatin Accessibility and Gene Expression. Statistical Science, 35(1), 2–13.
A model-based approach for the integrative analysis of single-cell chromatin (染色质) accessibility and gene expression data.
Single-cell sequencing-based technologies have become the primary tool to profile genomic features for hundreds or even thousands of cells in parallel. The measurement of gene expression is an imperfect substitute for the quantification of protein abundance.
Some single-cell sequencing technologies that capture functional genomic features are emerging and the available datasets are growing: including datasets from
- single-cell ChIP-seq
- single-cell methylation (甲基化)
- single-cell chromatin accessibility
The data structures (genomic features by samples/cells) are similar between single-cell genomic and bulk genomic data, but the distinct characteristics of single-cell genomic data poses challenges for data analysis and opportunities for methodology development:
- abundance of zeros:
- can be true biologically
- can be false due to the failure to detect the biological signal (technical failure to detect the signal is commonly observed in single-cell data and is referred to as dropout for single-cell gene expression experiments)
- batch effect/confounding variation:
- the standard balanced experimental designs are not possible for certain experimental protocols
The characterization of cell types based on their genomic signatures is one of the key computational challenges in single-cell genomics as the cell identity is unknown and needs to be inferred. The clustering methods:
- algorithm-based: build upon different similarity/distance metrics between the cells
- SNN-Cliq: shared nearest neighbor (SNN) graph: based upon a subset of genes and clusters cells by identifying and merging sub-graphs
- pcaReduce: integrates principal components analysis and hierarchical clustering
- RaceID: an iterative $k$-means clustering algorithm based on a similarity matrix of Pearson’s correlation coefficients
- SC3: an ensemble clustering algorithm that combines the clustering outcomes of several other methods
- CIDR: imputes the gene expression profiles, calculate the dissimilarly matrix based on the imputed data matrix
- probabilistic model-based: the benefit for model-based approaches is that the clustering uncertainty can be quantified for each single cell, facilitating rigorous statistical inference and biological interpretations
- DIMM-SC: builds upon a Dirichlet mixture model and is designed to cluster droplet-based single-cell transcriptomic data
The paper focus: the joint analysis of single-cell gene expression and single-cell chromatin accessibility data.
- Eukaryotic (真核) genomes are packaged into chromatin, and the nature of this packaging plays a central role in gene regulation
- ATAC-seq maps transposase-accessible chromatin regions, and provides information for understanding this epigenetic (表观遗传) structure of chromatin packaging and for understanding gene regulation
- Single-cell chromatin accessiblity (scATAT-Seq) maps chromatin accessibility at single-cell resolution and provides insight on the cell-to-cell variation of gene regulation
Given two data types obtained from the same cell population but from different cells, the goal is to cluster and match the cell types in these two data types
another related research,
Duren, Z., Chen, X., Zamanighomi, M., Zeng, W., Satpathy, A. T., Chang, H. Y., Wang, Y., & Wong, W. H. (2018). Integrative analysis of single-cell genomics data by coupled nonnegative matrix factorizations. Proceedings of the National Academy of Sciences, 115(30), 7723–7728.
propose a model-based approach to jointly cluster single-cell chromatin accessibility and single-cell gene expression data. The author claims that it has the following features:
- the model does not rely on training data to connect the two data types
- the noisiness in single-cell experiments is taken into account by explicitly modeling the loss of biological signals
- how well the two data types are matched is adaptively inferred from the data
- the model allows for statistical inference of the cluster assignment
- develop an efficient Markov chain Monte Carlo algorithm that incorporates collapsing and introduction of auxiliary variables