ClusterDE: a post-clustering DE method
Posted on
in typical scRNA-seq data analysis, a clustering algorithm is applied to find putative cell types as clusters, and then a statistical differential expression (DE) test is used to identify the DE genes between the cell clusters.
the common procedure uses the same data twice, an issue known as “double dipping”: the same data is used to define both cell clusters and DE genes, leading to false-positive DE genes even when the clusters are spurious.
the paper proposed ClusterDE, a post-clustering DE test for controlling the FDR of identified DE genes regardless of clustering quality.
core idea: generate real-data-based synthetic null data with only one cluster, as a counterfactual in contrast to the real data, for evaluating the whole procedure of clustering followed by a DE test
- $Y =[Y_{ij}]\in N^{n\times m}$: a cell-by-gene UMI count matrix with $n$ cells as rows, $m$ genes as columns
- $Z_i\in \{0, 1\}$: cell $i$’s latent cell type, partition $n$ cells to two latent cell types and are partitioned into two clusters by a clustering algorithm
- for gene $j$, $\{(Y_{ij}\mid Z_i=0)\}_{i=1}^n$ share the same mean denoted by $\mu_{0j} = \bbE[Y_{ij}\mid Z_i = 0]$, and $\{(Y_{ij}\mid Z_i=1)\}{i=1}^n$ share the same mean denoted by $\mu{1j} = \bbE[Y_{ij}\mid Z_i=1]$
- $H_{0j}:\mu_{0j} = \mu_{1j}$
- since $Z_i$’s are unobserved, standard single-cell data analysis partitions cells into two clusters using a clustering algorithm $g$
prediction in clustering??
- $\hat Z_i = g(Y_i)\in {0, 1}$ to denote cell $i$’s cluster membership
- $\mu_{0j}^{DD} = \bbE[Y_{ij}\mid \hat Z_i=0]$ and $\mu_{1j}^{DD} = \bbE[Y_{ij}\mid \hat Z_i=1]$
- the post-clustering DE test is $H_0^{DD}: \mu_{0j}^{DD} = \mu_{1j}^{DD}$
ClusterDE Step 1: synthetic null generation
- the null model: MVNB specified by the Gaussian copula
- fitting the null model to real data
- sample from the fitted null model (synthetic null data generation)
ClusterDE Step 2: cell clustering
apply the clustering to the target data and the synthetic null data in parallel
ClusterDE Step 3: DE analysis
- $S_j := -\log_{10}P_j$: target DE score of gene $j$
- $\tilde S_j := -\log_{10}\tilde P_j$: null DE score of gene $j$
ClusterDE Step 4: FDR control
the constrast score $C_j$ of gene $j$ is defined as
\[C_j = S_j - \tilde S_j\]