Cross-prediction-powered inference

Posted on Oct 22, 2025

Tags: Black-box

This note is for Zrnic, T., & Candès, E. J. (2024). Cross-prediction-powered inference. Proceedings of the National Academy of Sciences, 121(15), e2322083121.

cross-prediction: a method for valid inference powered by machine learning

cross-prediction is consistently more powerful than an adaptation of prediction-powered inference in which a fraction of the labeled data is split off and used to train the model

Problem Setup

semisupervised setting

$n$ i.i.d. feature-label pairs, ${(X_1, Y_1), \ldots, (X_n, Y_n)}\sim \bbP^n$
a dataset consisting of $N$ unlabeled data points, ${\bar X_1,\ldots, \bar X_N}\sim \bbP_X^N$

interested in the regime where $N » n$

the goal is to perform inference on a property $\theta^\star(\bbP)$

the proposal handles all estimands defined as a solution to an M-estimation problem

\[\theta^\star(\bbP) = \argmin_\theta L(\theta), \; \text{where }L(\theta) = \bbE[\ell_\theta(X, Y)]\]

the classical estimator: just dispense with the unlabeled data

\[\hat\theta^{class} = \argmin_\theta L^{class}(\theta), \text{ where }L^{class}(\theta) = \frac 1n\sum_{i=1}^n\ell_\theta(X_i, Y_i)\]

semisupervised inference
prediction-powered inference
- the core idea is to correct imputed predictions, and this derives from the proposal of prediction-powered inference
- However, a key assumption in prediction-powered inference is that: in addition to a labeled and unlabeled dataset, the analyst is given a good pretrained machine learning model. Here no such assumption
theory of cross-validation
semiparametric inference
- estimation in the presence of a high-dimensional nuisance parameter
inference with missing data
- semisupervised inference can be seen as a special case of the problem of inference with missing data
- the proposed method bears similarities to multiple imputation