Conformal Prediction for Single-cell Spatial Transcriptomics

Posted on Jun 07, 2024 0 Comments

This note is for Sun, E. D., Ma, R., Navarro Negredo, P., Brunet, A., & Zou, J. (2024). TISSUE: Uncertainty-calibrated prediction of single-cell spatial transcriptomics improves downstream analyses. Nature Methods, 21(3), 444–454.

whole-transcriptome spatial profiling

spatial gene expression prediction methods have been developed to infer the spatial expression of unmeasured transcripts, but the quality of these predictions can varyu greatly

the paper presents Transcript Imputation with Spatial Single-cell uncertainty Estimation (TISSUE) as a general framework for

estimating uncertainty for spatial gene expression predictions
providing uncertainty-aware methods for downstream inference

Introduction

there exists several methods for imputing or predicting spatial gene expression using a paired single-cell dataset. Generally, these methods proceed by joint embedding of the spatial transcriptomics and RNA-seq datasets and the predicting expression of new spatial genes by aggregating the nearest neighboring cells in the RNA-seq data or by joint probabilistic modeling, mapping or transport.

SpaGE: joint embedding of spatial transcriptomics and RNA-seq data using PRECISE domain adaptation followed by kNN regression
Harmony: Harmony integration of the two data modalities and averaging of nearest cell expression profiles
Tangram: use an optimal transport framework with a deep learning to devise a mapping between single-cell and spatial transcriptomics data

TISSUE can be leveraged for improvements in various uncertainty-aware data analysis tasks including the calculation of prediction intervals, hypothesis testing, supervised learning (for example, cell-type classification and anatomic region classification), and clustering and visualization of spatial transcriptomics data

Traditionally, conformal inference proceeds by

fitting a machine learning model on labeled training data,
evaluating the model predictions on a small amount of labeled calibration data to build calibrated uncerntainties
then deploying the model on unlabeled test data to obtain both the predicted labels and their uncertainty

make several key modifications to build well-calibrated uncertainties in TISSUE

establish an initial measure of prediction uncertainty that is scalable to unseen observations and agnostic to the prediction error
to calibrate these uncertainties to the prediction error, build distributions of calibration scores by linking these initial measures of uncertainty to the observed prediction errors on existing genes in the spatial transcriptomics data
these calibration score distributions were used for computing well-calibrated prediction intervals and improving downstream spatial transcriptomics data analysis

intuition:

large differences in predicted gene expression between neighboring cells of the same cell type would indicate low predictive performance
highly similar predicted gene expressions between neighboring cells would signify high predictive peformance for the spatial gene expression prediction method

introduce the cell-centric variability measure $U_{ij}$ which, for given a gene, computes for each cell a weighted measure of deviation between the predicted expression of that cell and those of the cells whithin a spatial neighborhood of it

\[U_{ij} = 1 + \sqrt{\frac{\sum_{k\in N_i} W_{ik}(\hat X_{kj}-\hat X_{ij})^2 }{\sum_{k\in N_i} W_{ik} }}\]

adapt a conformal inference framework by computing the calibration score, which is defined as the ratio between the absolute prediction error and the cell-centric variability

\[s_{ij} = \frac{\vert X_{ij} - \hat X_{ij}\vert }{U_{ij}}\]

Conformal prediction intervals

Retrieval of prediction intervals from calibration scores

For a given confidence level $\alpha$, construct the prediction interval with approximate probability coverage $1-\alpha$ by retrieving the $\frac{\lceil (n+1)(1-\alpha) \rceil}{n}$-th quantile of the upper interval calibration scores and lower interval calibration scores.

the nonsymmetric conformal prediction interval for the predicted gene expression of cell $i$ and gene $j$ can be computed according to

\[I_{ij} = \left[ \hat X_{ij} - U_{ij} \cdot q_\alpha^{(l)}, \hat X_{ij} + U_{ij} \cdot q_\alpha^{(u)} \right]\]

Uncertaity-aware hypothesis testing

Generating multiple imputations using calibration scores
modified two-sample t-test for multiple imputation

Published in categories Note

← previous

See all posts →

WeiYa's Work Yard

A dog, who fell into the ocean of statistics, tries to write down his ideas and notes to save himself.