WeiYa's Work Yard

A dog, who fell into the ocean of statistics, tries to write down his ideas and notes to save himself.

DNA copy number profiling: from bulk tissue to single cells

Posted on
Tags: Copy Number Variation, Single-cell

This post is based on the talk given by Yuchao Jiang at the 11th ICSA International Conference on Dec. 20th, 2019.

Copy number variation (CNV)

CNV detection by next-generation DNA-seq

Bulk DNA-seq:

  • Whole-genome sequencing (WGS)
  • Whole-exome sequencing (WES) & targeted sequencing

Single-cell DNA-seq:

  • Conventional whole-genome amplification
  • 10X Genomics Chromium Single Cell CNV Solution

Goal

Use bulk/single-cell DNA sequencing to accurately detect CNV

  • Based on depth of coverage, i.e., number of times a genomic region is “read”

Biases:

  • GC content
  • Capture and amplification efficiency
  • Sequencing bias
  • Latent factors
  • Batch effect
  • Population stratification

Existing Works

Fromer et al. (2012) said that

Because exome sequencing takes aim at a sparse (~1%) set of noncontiguous genomic targets (the exons), most CNV breakpoints will not be sequenced, leaving read depth as the predominant indicator of CNVs. However, the quantitative relationship between true copy number and depth is distorted by target- and sample-specific biases in exome hybridization (“capture”), PCR amplification, sequencing efficiency, and in silico read mapping, all of which are in turn affected by GC content of the targets, target size and sequence complexity, proximity to segmental duplications, nucleotide-level variation (SNPs), DNA concentration, hybridization temperature, experimental sample batching, and the complex interplay among these and various indeterminate factors.

They proposed a method called XHMM, and introduced that

XHMM extracts copy-number signal from noisy read depth by leveraging the large-scale nature of sequencing projects to discern patterns of read-depth biases. Specifically, we ran a principal-component analysis (PCA) on the sample-by-target-depth matrix by “rotating” the high-dimensional data to find the main modes in which depth varies across multiple samples and targets, and we removed the largest of such effects.

Here is a diagram illustrating some basic concepts of DNA, which adapted from 潘高的小站 – 外显子、内含子、mRNA、CDS、ORF的区别与联系

GC content bias

GC content is calculated as

\[\frac{G+C}{A+G+C+T}\times 100\%\,.\]

The theoretical distribution would be a Normal distribution (refer to 小L生信学习日记-4丨原始数据质量如何判断?-下).

Poisson latent factor model for normalization

CODEX: COpy number Detection by EXome-seq

source: https://github.com/yuchaojiang/CODEX

paper: CODEX: a normalization and copy number variation detection method for whole exome sequencing

Lower sensitivity for common CNVs

The speaker mentioned that the common CNV signals are attenuated by the Poisson latent factors, and he proposed the following solution.

CODEX2: full-spectrum CNV detection by NGS

source: https://github.com/yuchaojiang/CODEX2

Profiling somatic copy number aberrations by scDNA-seq

As for the difference of somatic mutations and germline mutations,

refer to somatic和germline突变有什么区别,问什么在研究癌症的过程中要区分这两种突变?

scDNA-seq data breast cancer patients

How to identify normal cells?

The Gini coefficient is calculated as

\[G = \frac{A}{A+B} = 2A = 1-2B\,.\]

A short but clear introduction for the Gini coefficient: Gini coefficient and Lorenz curve explained

SCOPE: Single-cell COPy number Estimation

10x Genomics single-cell CNV datasets

10x Genomics is a company!! Refer to 10x Genomics: 昨天今天明天 for more details.

References

Fromer, M., Moran, J. L., Chambert, K., Banks, E., Bergen, S. E., Ruderfer, D. M., Handsaker, R. E., McCarroll, S. A., O’Donovan, M. C., Owen, M. J., Kirov, G., Sullivan, P. F., Hultman, C. M., Sklar, P., & Purcell, S. M. (2012). Discovery and Statistical Genotyping of Copy-Number Variation from Whole-Exome Sequencing Depth. The American Journal of Human Genetics, 91(4), 597–607.


Published in categories ICSA-2019