Conformal Prediction for Single-cell Spatial Transcriptomics
Posted on 0 Comments
whole-transcriptome spatial profiling
- spatial gene expression prediction methods have been developed to infer the spatial expression of unmeasured transcripts, but the quality of these predictions can varyu greatly
the paper presents Transcript Imputation with Spatial Single-cell uncertainty Estimation (TISSUE) as a general framework for
- estimating uncertainty for spatial gene expression predictions
- providing uncertainty-aware methods for downstream inference
Introduction
there exists several methods for imputing or predicting spatial gene expression using a paired single-cell dataset. Generally, these methods proceed by joint embedding of the spatial transcriptomics and RNA-seq datasets and the predicting expression of new spatial genes by aggregating the nearest neighboring cells in the RNA-seq data or by joint probabilistic modeling, mapping or transport.
- SpaGE: joint embedding of spatial transcriptomics and RNA-seq data using PRECISE domain adaptation followed by kNN regression
- Harmony: Harmony integration of the two data modalities and averaging of nearest cell expression profiles
- Tangram: use an optimal transport framework with a deep learning to devise a mapping between single-cell and spatial transcriptomics data
TISSUE can be leveraged for improvements in various uncertainty-aware data analysis tasks including the calculation of prediction intervals, hypothesis testing, supervised learning (for example, cell-type classification and anatomic region classification), and clustering and visualization of spatial transcriptomics data
Traditionally, conformal inference proceeds by
- fitting a machine learning model on labeled training data,
- evaluating the model predictions on a small amount of labeled calibration data to build calibrated uncerntainties
- then deploying the model on unlabeled test data to obtain both the predicted labels and their uncertainty
make several key modifications to build well-calibrated uncertainties in TISSUE
- establish an initial measure of prediction uncertainty that is scalable to unseen observations and agnostic to the prediction error
- to calibrate these uncertainties to the prediction error, build distributions of calibration scores by linking these initial measures of uncertainty to the observed prediction errors on existing genes in the spatial transcriptomics data
- these calibration score distributions were used for computing well-calibrated prediction intervals and improving downstream spatial transcriptomics data analysis
intuition:
- large differences in predicted gene expression between neighboring cells of the same cell type would indicate low predictive performance
- highly similar predicted gene expressions between neighboring cells would signify high predictive peformance for the spatial gene expression prediction method
introduce the cell-centric variability measure $U_{ij}$ which, for given a gene, computes for each cell a weighted measure of deviation between the predicted expression of that cell and those of the cells whithin a spatial neighborhood of it
\[U_{ij} = 1 + \sqrt{\frac{\sum_{k\in N_i} W_{ik}(\hat X_{kj}-\hat X_{ij})^2 }{\sum_{k\in N_i} W_{ik} }}\]adapt a conformal inference framework by computing the calibration score, which is defined as the ratio between the absolute prediction error and the cell-centric variability
\[s_{ij} = \frac{\vert X_{ij} - \hat X_{ij}\vert }{U_{ij}}\]Conformal prediction intervals
Retrieval of prediction intervals from calibration scores
For a given confidence level $\alpha$, construct the prediction interval with approximate probability coverage $1-\alpha$ by retrieving the $\frac{\lceil (n+1)(1-\alpha) \rceil}{n}$-th quantile of the upper interval calibration scores and lower interval calibration scores.
the nonsymmetric conformal prediction interval for the predicted gene expression of cell $i$ and gene $j$ can be computed according to
\[I_{ij} = \left[ \hat X_{ij} - U_{ij} \cdot q_\alpha^{(l)}, \hat X_{ij} + U_{ij} \cdot q_\alpha^{(u)} \right]\]Uncertaity-aware hypothesis testing
- Generating multiple imputations using calibration scores
- modified two-sample t-test for multiple imputation