Eleven Challengs in Single Cell Data Science
Posted on (Update: )
Single-cell RNA sequencing (scRNA-seq): transcriptome-wide gene expression measurement at single-cell resolution.
- distinguish cell type clusters
- the arrangement of populations of cells according to novel hierarchies
- the identification of cells transitioning between states
- much clearer view of the dynamics of tissue and organism development
- structures within cell populations that had so far been perceived as homogeneous Single-cell DNA sequencing (scDNA-seq): highlight somatic clonal structures
- track the formation of cell lineages
- provide insight into evolutionary processes acting on somatic mutations.
opportunities of sc-seq: re-evaluate hypotheses about differences between predefined sample groups at the single-cell level, no matter if such sample groups are disease subtypes, treatment groups, or simply morphologically distinct cell types.
Sc-seq datasets comprising very large cell numbers are becoming available worldwide, constituting a data revolution for the field of single-cell analysis.
Challenges:
- limited amounts of material available per cell lead to high levels of uncertainty about observations
- when amplification is used to generate more material, technical noise is added to the resulting data
- any increase in resolution results in rapidly growing dimension in data matrices
- various challenges: by research goal, tissue analyzed, experimental setup, or just by whether DNA or RNA is sequenced
Challenges in single-cell transcriptomics
I: Handling sparsity in scRNA-seq
scRNA-seq measurements typically suffer from large fractions of observed zeros, where a given gene in a given cell has no unique molecular identifiers or reads mapping to it.
term dropout: denote observed zero values in scRNA-seq data
- methodological noise, where a gene is expressed but not detected by the sequencing technology
- biologically-true absence of expression
the degree of sparsity depends on
- the scRNA-seq platform used
- the sequencing depth
- the underlying expression level of the gene
focus on the linked problems of learning latent spaces and imputing expression values from scRNA-seq data
the imputation of missing values has been very successful for genotype data
- typically know which data are missing
- rich sources of external information are available (e.g. haplotype reference panels)
but the situation is different for scRNA-seq data
- we do not routinely have external reference information to apply
- we can never be sure which of the observed zeros represent “missing data” and which accurately represent a true absence of gene expression in the cell
two broad approaches can be applied to tackle this problem of sparsity
- use statistical models that inherently model the sparsity, sampling variation and noise modes of scRNA-seq data with an appropriate data generative model
- attempt to “impute” values for observed zeros
three broad (and often overlapping) categories of methods that can be used to impute scRNA-seq data in the absence of an external reference
- model-based imputation methods: use probabilistic models to identify which observed zeros represent technical rather than biological zeros, only impute technical zeros, leaving other observed expression levels untouched
- data-smoothing methods define a “similarity” between cells and adjust expression values for each cell based on expression values in similar cells. usually adjust all expression values, including technical zeros, biological zeros, and observed non-zero values
- data-reconstruction methods: aim to define a latent space representation of the cells. Often done through matrix factorization (e.g., PCA) or through machine learning approaches (e.g., variational autoencoder that exploit deep neural networks to capture non-linear relationships)
- imputation with an external dataset or reference, using it for transfer learning
open problems
major challenge: the circularity that imputation solely relies on information that is internal to the imputed dataset.
to avoid the circularity: identify reliable external sources of information that can inform the imputation process
- exploit external reference panels
- systematic integration of known biological network structures in the imputation process
- explore complementary types of data that can inform scRNA-seq imputation
II: Defining flexible statistical frameworks for discovering complex differential patterns in gene expression
scRNA-seq enables a high granularity of changes in expression to be unraveled.
- changes in expression patterns
- cell type-specific changes in cell state across samples
status
double use of data in the differential expression detection methods
- status
- clustering for cell type assignment to identify such groups
- differential testing between clusters
- open problem
- integrative approaches
- selective inference methods
(something like the wrong way to do cross validation)
general patterns of differential groups
- status
- average expression between groups, but the single cell-specific methods do not uniformly outperform the state-of-the-art bulk methods
- changes in trajectory along pseudotime (the ordering induced by the trajectory)
- changes in distributions
- open problem
- some existing methods could be further improved by integrating with other approaches that account for confounding effects such as cell cycle and complex batch effects
- connect with other single aspects of single-cell expression dynamics, such as cell type composition, RNA velocity, splicing, and allele specificity.
methods for cross-sample comparisons of gene expression
- status: expression is aggregated over multiple cells within
- each sample
- mixed models
- open problem
- some approaches can be expanded to the higher dimensions and characteristic aspects of scRNA-seq data
- a large space to explore other general and flexible approaches, such as hierarchical models where information is borrowed across samples or exploring changes in full distributions
Challenge III: Mapping single cells to a reference atlas
lack of appropriate, available references implies only reference-free approaches, and unsupervised clustering approaches were the predominant option, but it involves manual cluster annotations.
cell atlases, as reference systems that systematically capture cell types and states, either tissue specific or across different tissues, remedy this issue
- status
- some atlas type references has been published, but for human and others are still under way
- the availability of these reference atlases leads to the active development of methods of supervised classification methods
- open problems
- cell atlases are still under active development
- further benchmarking of methods that map cells of unknown type or state onto reference atlases
Challenge IV: Generalizing trajectory inference
status: start from a count matrix where genes are rows and cells are columns, by clustering and other methods (probabilistic descriptions, Gaussian process latent variable model) to infer pseudotime and/or branching trajectories open problems:
- define the features
- limited number of cells
- limited input material for each cell
challenges:
- assess how the various trajectory inference methods perform on different data types
- comparison or alignment, comparing different trajectories obtained from the same data type but across individuals or conditions
Challenge V: Finding patterns in spatially resolved measurements
we can obtain transcript abundance measurements while retaining spatial coordinates of cells or even transcripts within a tissue
for determining cell types, or clustering cells into groups, no method currently directly used spatial information.
open problems:
- consider gene or transcript expression and spatial coordinates of cells, and derive an assignment of cells to classes, functional groups, or cell types
- whether uncertainty in the measurements can be propagated to downstream analyses
Challenges in single-cell genomics
The genetic mutation events can occur during disease progression, and the resulting tumor cell populations are highly heterogeneous. As tumor heterogeneity can predict patient survival and response to therapy, and understanding its dynamics are crucial for improving diagnosis and therapeutic choices.
scDNA-seq requires amplification, but it introduces errors and biases
- PCR-based: require thermostable
- MDA-based: sensitive to the DNA input quality
Challengle VI: Dealing with errors and missing data in the identification of variation from scDNA-seq data
major disturbing factor in scDNA-seq data is the WGA process
- amplification errors
- the effect of amplification bias, leads to imbalanced proportions or complete lack of variant alleles.
- an imbalanced proportion of alleles
- allele dropout
- site dropout
status: some existing single cell-specific SNV callers open problem:
- SNV callers for scDNA-seq data have already incorporated amplification error rates and allele dropout in their models, the challenge remains to further extend this by directly modeling the amplification process using statistics
- the integration with deep bulk sequencing data, as well as with scRNA-seq data, remains unexplored
- identification of short insertions and deletions (indels)
- systematic comparison of tools beyond the respective software publications
Challenges in single-cell phylogenomics
Models of cancel evolution may range from a simple representation of the presence versus the absence of a particular mutational event, to elaborate models of the mechanisms and rates of distinct mutational events. Two main modeling approaches to the analysis of tumor evolution:
- phylogenetics: key challenges
- biologically realistic
- computationally tractable
- population genetics: branching processes
Challenge VII: Scaling phylogenetic models to many cells and many sites
computational tractability, induced by
- the increasing numbers of cells that are sequenced in cancer studies
- the increasing numbers of sites that can be queried per genome
Challenge VIII: Integrating multiple types of variation into phylogenetic models
co-occurrence of all variation types further complicates mathematical modeling, as these events are not independent.
open problems: incorporate all variation types into a holistic model of cancel evolution
- SNVs: anticipate improvements in input data quality
- indels: anticipate variant callers but remain to be developed
- CNVs: determine correct mutational profiles and compute realistic transition probabilities
Challenge IX: Inferring population genetic parameters of tumor heterogeneity by model integration
Although many mathematical models of tumor evolution have been proposed, fundamental parameters characterizing the evolutionary processes remain elusive.
open problems:
- integrate the subclone genotypes with the spatial location of single cells obtained from other measurements
- detect positive or diversifying selection with greater resolution
- adapt models for the detection of epistatic interactions to single-cell data
Overarching challenges
Challenge X: Integration of single-cell data across samples, experiments, and types of measurement
- integrate datasets across samples in one experiment: a few approaches are available
- integrate datasets across experiments:
- the most promising strategy is mapping cells to reference datasets such as the Human Cell Atlas
- assemble cell type clusters from different experiments
- integrate across multiple measurement types from the same cell: bulk approaches that address the integration of data from different types of experiments have the potential to be adapted to single cell-specific noise characteristics
- open problem: dependencies among those measurement types
- integrate across multiple measurement types from separate cells: identify subpopulations that had so far remained invisible.
No matter which combinations of measurement types, integrating data across experiments and different measurement types will further compound the challenge of missing data.
Challenge XI: Validating and benchmarking analysis tools for single-cell measurements
algorithms and pipelines should be able to pass two quality control tests:
- produce the expected results
- be robust to high levels of sequencing noise and technological biases
real datasets for benchmarking are scarce, and development of reliable simulation tools requires design and implementation of models that capture the essence of underlying biological processes and technological details of single-cell technologies and high-throughput sequencing platforms
status: current simulators are usually not available as separate tools, and they are only used as auxiliary subroutines inside particular projects and are not published as stand-alone tools, but also few exceptions.
open problems:
- no such simulation tool for scDNA-seq data
- for single-cell phylogenomics, realistic and comprehensive simulation tools are required.
- most of the simulators concentrate on modeling of biological meaningful data, while ignoring or simplifying models for sc-seq errors and artifacts
- the selection of comprehensive evaluation metrics
- ideally, such a benchmarking framework would remain dynamic beyond an initial publication–to allow ongoing comparison of methods as new approaches are proposed.