Cross-prediction-powered inference
Posted on
This note is for Zrnic, T., & Candès, E. J. (2024). Cross-prediction-powered inference. Proceedings of the National Academy of Sciences, 121(15), e2322083121.
cross-prediction: a method for valid inference powered by machine learning
cross-prediction is consistently more powerful than an adaptation of prediction-powered inference in which a fraction of the labeled data is split off and used to train the model
Problem Setup
semisupervised setting
- $n$ i.i.d. feature-label pairs, ${(X_1, Y_1), \ldots, (X_n, Y_n)}\sim \bbP^n$
- a dataset consisting of $N$ unlabeled data points, ${\bar X_1,\ldots, \bar X_N}\sim \bbP_X^N$
interested in the regime where $N » n$
the goal is to perform inference on a property $\theta^\star(\bbP)$
the proposal handles all estimands defined as a solution to an M-estimation problem
\[\theta^\star(\bbP) = \argmin_\theta L(\theta), \; \text{where }L(\theta) = \bbE[\ell_\theta(X, Y)]\]the classical estimator: just dispense with the unlabeled data
\[\hat\theta^{class} = \argmin_\theta L^{class}(\theta), \text{ where }L^{class}(\theta) = \frac 1n\sum_{i=1}^n\ell_\theta(X_i, Y_i)\]Related Work
- semisupervised inference
- prediction-powered inference
- the core idea is to correct imputed predictions, and this derives from the proposal of prediction-powered inference
- However, a key assumption in prediction-powered inference is that: in addition to a labeled and unlabeled dataset, the analyst is given a good pretrained machine learning model. Here no such assumption
- theory of cross-validation
- semiparametric inference
- estimation in the presence of a high-dimensional nuisance parameter
- inference with missing data
- semisupervised inference can be seen as a special case of the problem of inference with missing data
- the proposed method bears similarities to multiple imputation
Cross-Prediction
basic idea: impute labels for the unlabeled data points, and then remove the bias arising from the inaccuracies in the predictions using the labeled data