Generalized Estimating Equations for Differentially Expressed Genes in Spatial Transcriptomics
Posted on
- seurat, by far the most popular tool for analyzing ST data, uses the Wilcoxon rank-sum test by default for differential expression analysis
- the paper proposes a Generalized Score Test (GST) in the Generalized Estimating Equations (GEEs) framework as a robust solution for differential gene expression analysis in ST.
Introduction
a common task in ST data analysis involves identifying DE genes across pathological grades
this study considered two potential approaches: generalized linear mixed model (GLMM) and generalized estimating equations (GEE)
2.2 Generalized Linear Mixed Model (GLMM)
Let $Y_{ij}$ represent the gene expression count for gene at spatial location $i$.
- $X_i$: binary dummy variable for pathology grade at location $i$
- spatial coordinates $s_i = (s_{i1}, s_{i2})$ for spatial location $i$
the model is specified as
\[\log \mu_{ij} = X_i^T\beta + \varepsilon_{ij}\]where
- $\mu_{ij}$ is the expected count for gene $j$ at location $i$
- $\beta = [\beta_{j0}, \beta_{j1}]^T$ is the vector of fixed effect coefficients, where $\beta_{j0}$ is the intercept and $\beta_{j1}$ is the effect of Grade B compared to Grade A
- the random effect $\varepsilon_{ij}$ is assumed to follow a normal distribution, $\varepsilon_{ij}\sim N(0, V(\sigma_j^2, \kappa_j, \tau))$, representing the random effect for gene $j$ at the $i$-th spatial location
for a gievn gene $j$, the spatial covariance matrix $V(\sigma_j^2, k_j, \tau)$ is defined based on the distances between pairs of spatial locations
the $(i, i’)$-th element of $V(\sigma_j^2, k_j, \tau)$ is given by
\[V_{ii'}(\sigma_j^2, k_j, \tau) = \sigma_j^2R(\tau, k_j) = \sigma_j^2 \exp(-\tau_{ii'}/k_j)\]- $\tau_{ii’} = \Vert s_i - s_{i’}\Vert$ denotes the Euclidean distance between two spatial locations $i$ and $i’$
- $k_j > 0$ is a parameter that determines the rate of decay in correlation with distance, with larger values of $k_j$ indicating stronger correlations and smaller semi-variances
- the exponential spatial structure here is a specific case of the Matern correlation structure $R(\tau)$
test for DE genes across the two pathology grades, test
\[H_0: \beta_{j1} = 0\qquad H_a: \beta_{j1} \neq 0\]2.3 Generalized Estimating Equations (GEE)
instead of explicitly modeling the spatial correlation structure using random effects, the GEE use a “working” correlation matrix to account for the spatial dependence between observations
the paper adopted the GEEs model with an independent working correlation structure by dividing the whose ST tissue into $m$ spatial clusters using $K$-means clustering.
the mean of $Y_{ij}$, denoted by $\mu_{ij}$, is linked to the covariates through a log link function, and the model is specified as
\[\log\mu_{ij} = X_i^T\beta\]where $\beta = [\beta_{j0}, \beta_{j1}]$ is the vector of fixed effect coefficients.
The parameters $\beta$ are estimated by solving the GEE:
\[\sum_{i=1}^m D_i^TV_i^{-1}(Y_i - \mu_i) = 0\]where
- $D_i$ is the derivative of the mean response with respect to $\beta$ in the cluster $i$: $D_i = \frac{\partial \mu_i}{\partial \beta}$
- $V_i$ is the variance-covariance matrix of responses in the cluster $i$, which is a function of the working correlation matrix $R_i$ ($V_i = C_i^{1/2}R_iC_i^{1/2}$)
- $C_i$ is the diagonal matrix that includes the variances of the individual observations within the cluster $i$
the robust variance estimate for the estimated coefficients is calculated by the sandwich estimator
\[\widehat{Var}(\hat\beta) = \left(\sum_{i=1}^m D_i^TV_i^{-1}D_i \right)^{-1}\left(\sum_{i=1}^m D_i^TV_i^{-1}(Y_i - \mu_i)(Y_i - \mu_i)^TV_i^{-1}D_i\right)\left(\sum_{i=1}^m D_i^TV_i^{-1}D_i\right)^{-1}\]the advantages of the GEE framework is that it produces consistent and asymptotically normal estimates of the parameters even if the working correlation structure is incorrectly specified
2.3.1 Robust Wald Test
the robust Wald test statistic $W$ is computed as
\[W = \frac{\hat\beta_{j1}}{\widehat{se}(\hat\beta_{j1})}\]2.3.2 Generalized Score Test (GST)
The asymptomatic equivalence between the GST statistic and the robust Wald statistic holds only as the number of spatial clusters is large.
the score function $U(\hat\beta_0)$ is the derivative of the quasi-likelihood with respect to the parameter $\beta$:
\[U(\hat\beta_0) = \sum_{i=1}^m D_i^TV_i^{-1}(Y_i - \mu_i)\]The GST statistic $S$ is computed as
\[S = \frac{U(\hat\beta_{j1})}{\widehat{se}(U(\hat\beta_{j1}))}\]3. Results
3.1 Simulation Studies
- fit the GLMM to a breast cancer ST dataset from 10x Genomics. Two key spatial parameters:
- $\sigma^2$: spatial variance that captures variability in gene expression across different spatial locations
- $\kappa$: spatial correlation that controls the rate of decay in correlation with distance between spatial locations
- use the estimated parameters to generate simulated data
3.2 Real Data
3.2.1 Breast Cancer ST dataset
3798 spots and 24923 genes, retaining the 3000 most variable genes
3.2.2 Prostate Cancer ST dataset
4371 spots and 16907 genes, retaining the 3000 most variable genes