WeiYa's Work Yard

A traveler with endless curiosity, who fell into the ocean of statistics, tries to write down his ideas and notes to save himself.

Cross-prediction-powered inference

Posted on
Tags: Black-box

This note is for Zrnic, T., & Candès, E. J. (2024). Cross-prediction-powered inference. Proceedings of the National Academy of Sciences, 121(15), e2322083121.

cross-prediction: a method for valid inference powered by machine learning

cross-prediction is consistently more powerful than an adaptation of prediction-powered inference in which a fraction of the labeled data is split off and used to train the model

Problem Setup

semisupervised setting

  • $n$ i.i.d. feature-label pairs, ${(X_1, Y_1), \ldots, (X_n, Y_n)}\sim \bbP^n$
  • a dataset consisting of $N$ unlabeled data points, ${\bar X_1,\ldots, \bar X_N}\sim \bbP_X^N$

interested in the regime where $N » n$

the goal is to perform inference on a property $\theta^\star(\bbP)$

the proposal handles all estimands defined as a solution to an M-estimation problem

\[\theta^\star(\bbP) = \argmin_\theta L(\theta), \; \text{where }L(\theta) = \bbE[\ell_\theta(X, Y)]\]

the classical estimator: just dispense with the unlabeled data

\[\hat\theta^{class} = \argmin_\theta L^{class}(\theta), \text{ where }L^{class}(\theta) = \frac 1n\sum_{i=1}^n\ell_\theta(X_i, Y_i)\]
  • semisupervised inference
  • prediction-powered inference
    • the core idea is to correct imputed predictions, and this derives from the proposal of prediction-powered inference
    • However, a key assumption in prediction-powered inference is that: in addition to a labeled and unlabeled dataset, the analyst is given a good pretrained machine learning model. Here no such assumption
  • theory of cross-validation
  • semiparametric inference
    • estimation in the presence of a high-dimensional nuisance parameter
  • inference with missing data
    • semisupervised inference can be seen as a special case of the problem of inference with missing data
    • the proposed method bears similarities to multiple imputation

Cross-Prediction

basic idea: impute labels for the unlabeled data points, and then remove the bias arising from the inaccuracies in the predictions using the labeled data

cross-prediction for mean estimation


Published in categories