ClusterDE: a post-clustering DE method

Posted on Dec 04, 2023

Tags: False Discovery Rate, Differential Expression, Knockoffs, Single-cell

This post is for Song, Dongyuan, Kexin Li, Xinzhou Ge, and Jingyi Jessica Li. “ClusterDE: A Post-Clustering Differential Expression (DE) Method Robust to False-Positive Inflation Caused by Double Dipping,” 2023

in typical scRNA-seq data analysis, a clustering algorithm is applied to find putative cell types as clusters, and then a statistical differential expression (DE) test is used to identify the DE genes between the cell clusters.

the common procedure uses the same data twice, an issue known as “double dipping”: the same data is used to define both cell clusters and DE genes, leading to false-positive DE genes even when the clusters are spurious.

the paper proposed ClusterDE, a post-clustering DE test for controlling the FDR of identified DE genes regardless of clustering quality.

core idea: generate real-data-based synthetic null data with only one cluster, as a counterfactual in contrast to the real data, for evaluating the whole procedure of clustering followed by a DE test

$Y =[Y_{ij}]\in N^{n\times m}$: a cell-by-gene UMI count matrix with $n$ cells as rows, $m$ genes as columns
$Z_i\in \{0, 1\}$: cell $i$’s latent cell type, partition $n$ cells to two latent cell types and are partitioned into two clusters by a clustering algorithm
for gene $j$, $\{(Y_{ij}\mid Z_i=0)\}_{i=1}^n$ share the same mean denoted by $\mu_{0j} = \bbE[Y_{ij}\mid Z_i = 0]$, and $\{(Y_{ij}\mid Z_i=1)\}{i=1}^n$ share the same mean denoted by $\mu{1j} = \bbE[Y_{ij}\mid Z_i=1]$
$H_{0j}:\mu_{0j} = \mu_{1j}$
since $Z_i$’s are unobserved, standard single-cell data analysis partitions cells into two clusters using a clustering algorithm $g$

prediction in clustering??

$\hat Z_i = g(Y_i)\in {0, 1}$ to denote cell $i$’s cluster membership
$\mu_{0j}^{DD} = \bbE[Y_{ij}\mid \hat Z_i=0]$ and $\mu_{1j}^{DD} = \bbE[Y_{ij}\mid \hat Z_i=1]$
the post-clustering DE test is $H_0^{DD}: \mu_{0j}^{DD} = \mu_{1j}^{DD}$

ClusterDE Step 1: synthetic null generation

the null model: MVNB specified by the Gaussian copula
fitting the null model to real data
sample from the fitted null model (synthetic null data generation)

ClusterDE Step 2: cell clustering

apply the clustering to the target data and the synthetic null data in parallel

ClusterDE Step 3: DE analysis

$S_j := -\log_{10}P_j$: target DE score of gene $j$
$\tilde S_j := -\log_{10}\tilde P_j$: null DE score of gene $j$

ClusterDE Step 4: FDR control

the constrast score $C_j$ of gene $j$ is defined as

\[C_j = S_j - \tilde S_j\]

Published in categories Note

← previous next →

See all posts →

WeiYa's Work Yard

A traveler with endless curiosity, who fell into the ocean of statistics, tries to write down his ideas and notes to save himself.