Generalized Estimating Equations for Differentially Expressed Genes in Spatial Transcriptomics

Posted on Oct 03, 2025

Tags: Generalized Estimating Equations, Differentially Expressed Genes, Spatial Transcriptomics

This note is for Wang, Y., Zang, C., Li, Z., Guo, C. C., Lai, D., & Wei, P. (2025). A comparative study of statistical methods for identifying differentially expressed genes in spatial transcriptomics (p. 2025.02.17.638726). bioRxiv.

seurat, by far the most popular tool for analyzing ST data, uses the Wilcoxon rank-sum test by default for differential expression analysis
the paper proposes a Generalized Score Test (GST) in the Generalized Estimating Equations (GEEs) framework as a robust solution for differential gene expression analysis in ST.

Introduction

a common task in ST data analysis involves identifying DE genes across pathological grades

this study considered two potential approaches: generalized linear mixed model (GLMM) and generalized estimating equations (GEE)

2.2 Generalized Linear Mixed Model (GLMM)

Let $Y_{ij}$ represent the gene expression count for gene at spatial location $i$.

$X_i$: binary dummy variable for pathology grade at location $i$
spatial coordinates $s_i = (s_{i1}, s_{i2})$ for spatial location $i$

the model is specified as

\[\log \mu_{ij} = X_i^T\beta + \varepsilon_{ij}\]

where

$\mu_{ij}$ is the expected count for gene $j$ at location $i$
$\beta = [\beta_{j0}, \beta_{j1}]^T$ is the vector of fixed effect coefficients, where $\beta_{j0}$ is the intercept and $\beta_{j1}$ is the effect of Grade B compared to Grade A
the random effect $\varepsilon_{ij}$ is assumed to follow a normal distribution, $\varepsilon_{ij}\sim N(0, V(\sigma_j^2, \kappa_j, \tau))$, representing the random effect for gene $j$ at the $i$-th spatial location

for a gievn gene $j$, the spatial covariance matrix $V(\sigma_j^2, k_j, \tau)$ is defined based on the distances between pairs of spatial locations

the $(i, i’)$-th element of $V(\sigma_j^2, k_j, \tau)$ is given by

\[V_{ii'}(\sigma_j^2, k_j, \tau) = \sigma_j^2R(\tau, k_j) = \sigma_j^2 \exp(-\tau_{ii'}/k_j)\]

$\tau_{ii’} = \Vert s_i - s_{i’}\Vert$ denotes the Euclidean distance between two spatial locations $i$ and $i’$
$k_j > 0$ is a parameter that determines the rate of decay in correlation with distance, with larger values of $k_j$ indicating stronger correlations and smaller semi-variances
the exponential spatial structure here is a specific case of the Matern correlation structure $R(\tau)$

test for DE genes across the two pathology grades, test

\[H_0: \beta_{j1} = 0\qquad H_a: \beta_{j1} \neq 0\]

2.3 Generalized Estimating Equations (GEE)

instead of explicitly modeling the spatial correlation structure using random effects, the GEE use a “working” correlation matrix to account for the spatial dependence between observations

the paper adopted the GEEs model with an independent working correlation structure by dividing the whose ST tissue into $m$ spatial clusters using $K$-means clustering.

the mean of $Y_{ij}$, denoted by $\mu_{ij}$, is linked to the covariates through a log link function, and the model is specified as

\[\log\mu_{ij} = X_i^T\beta\]

where $\beta = [\beta_{j0}, \beta_{j1}]$ is the vector of fixed effect coefficients.

The parameters $\beta$ are estimated by solving the GEE:

\[\sum_{i=1}^m D_i^TV_i^{-1}(Y_i - \mu_i) = 0\]

where

$D_i$ is the derivative of the mean response with respect to $\beta$ in the cluster $i$: $D_i = \frac{\partial \mu_i}{\partial \beta}$
$V_i$ is the variance-covariance matrix of responses in the cluster $i$, which is a function of the working correlation matrix $R_i$ ($V_i = C_i^{1/2}R_iC_i^{1/2}$)
$C_i$ is the diagonal matrix that includes the variances of the individual observations within the cluster $i$

the robust variance estimate for the estimated coefficients is calculated by the sandwich estimator

\[\widehat{Var}(\hat\beta) = \left(\sum_{i=1}^m D_i^TV_i^{-1}D_i \right)^{-1}\left(\sum_{i=1}^m D_i^TV_i^{-1}(Y_i - \mu_i)(Y_i - \mu_i)^TV_i^{-1}D_i\right)\left(\sum_{i=1}^m D_i^TV_i^{-1}D_i\right)^{-1}\]

the advantages of the GEE framework is that it produces consistent and asymptotically normal estimates of the parameters even if the working correlation structure is incorrectly specified

2.3.1 Robust Wald Test

the robust Wald test statistic $W$ is computed as

\[W = \frac{\hat\beta_{j1}}{\widehat{se}(\hat\beta_{j1})}\]

2.3.2 Generalized Score Test (GST)

The asymptomatic equivalence between the GST statistic and the robust Wald statistic holds only as the number of spatial clusters is large.

the score function $U(\hat\beta_0)$ is the derivative of the quasi-likelihood with respect to the parameter $\beta$:

\[U(\hat\beta_0) = \sum_{i=1}^m D_i^TV_i^{-1}(Y_i - \mu_i)\]

The GST statistic $S$ is computed as

\[S = \frac{U(\hat\beta_{j1})}{\widehat{se}(U(\hat\beta_{j1}))}\]

3. Results

3.1 Simulation Studies

fit the GLMM to a breast cancer ST dataset from 10x Genomics. Two key spatial parameters:
- $\sigma^2$: spatial variance that captures variability in gene expression across different spatial locations
- $\kappa$: spatial correlation that controls the rate of decay in correlation with distance between spatial locations
use the estimated parameters to generate simulated data

3.2 Real Data

3.2.1 Breast Cancer ST dataset

3798 spots and 24923 genes, retaining the 3000 most variable genes

3.2.2 Prostate Cancer ST dataset

4371 spots and 16907 genes, retaining the 3000 most variable genes

4. Discussion

Published in categories

← previous next →

See all posts →

WeiYa's Work Yard

A traveler with endless curiosity, who fell into the ocean of statistics, tries to write down his ideas and notes to save himself.