WeiYa's Work Yard

A dog, who fell into the ocean of statistics, tries to write down his ideas and notes to save himself.

Data Fission

Posted on (Update: ) 0 Comments
Tags: Post-selection Inference

This note is for the discussion paper Leiner, J., Duan, B., Wasserman, L., & Ramdas, A. (2023). Data fission: Splitting a single data point (arXiv:2112.11079). arXiv. http://arxiv.org/abs/2112.11079 in the JASA invited session at JSM 2024

image

  • Apr. 2018: Tian and Taylor use external randomization for Gaussian conditional selective inference
  • Nov. 2019: Data blurring fission research begins
  • Feb. 2021: Rasines and Young apply Gaussian external randomization to non-Gaussian data with asymptotic error guarantees
  • Dec. 2021: Data blurring fission preprint posted: initially unware of Rasines and Young (2022)
  • Jan. 2023: Data thinning [Neufeld et al. (2024)] creates improved decomposition rule for distributions falling into “convolution-based” class (e.g., Gaussian, Poisson, Binomial)
    • Neufeld, A., Dharamshi, A., Gao, L. L., & Witten, D. (2024). Data Thinning for Convolution-Closed Distributions. Journal of Machine Learning Research, 25(57), 1–35.
  • Mar. 2023: Generalized data thinning [Dharamshi et al. (2024)] unifies a variety of splitting, thinning, and fission procedures via concept of sufficiency
  • Jan. 2024: Graph fission [Leiner and Ramdas (2024)] explores fission procedures for inference and trend estimation on graph data sets

image

image

image

Generic selective inference problems

Main object of interest: $P_\theta(T(X)\mid S(X))$

  • $\theta$ unknown parameter; $S(X)$ selection; $T(X)$ inference; can be randomized
  • Examples:
    • inference for coefficients after model selection
      • Lee, J. D., Sun, D. L., Sun, Y., & Taylor, J. E. (2016). Exact Post-Selection Inference, with Application to the Lasso. The Annals of Statistics, 44(3), 907–927.
      • Fithian, W., Sun, D., & Taylor, J. (2017). Optimal Inference After Model Selection (No. arXiv:1410.2597). arXiv. https://doi.org/10.48550/arXiv.1410.2597
      • Tian, X., & Taylor, J. (2017). Asymptotics of Selective Inference. Scandinavian Journal of Statistics, 44(2), 480–499. https://doi.org/10.1111/sjos.12261
    • hypothesis testing after clustering
    • data-driven ranking of hypotheses for multiple testing
      • Barber, R. F., & Candès, E. J. (2015). Controlling the false discovery rate via knockoffs. The Annals of Statistics, 43(5). https://doi.org/10.1214/15-AOS1337
      • Candes, E., Fan, Y., Janson, L., & Lv, J. (2017). Panning for Gold: Model-X Knockoffs for High-dimensional Controlled Variable Selection. arXiv:1610.02351 [Math, Stat]. http://arxiv.org/abs/1610.02351
      • Lei, L., & Fithian, W. (2018). AdaPT: An Interactive Procedure for Multiple Testing with Side Information. Journal of the Royal Statistical Society Series B: Statistical Methodology, 80(4), 649–679. https://doi.org/10.1111/rssb.12274
      • Li, A., & Barber, R. F. (2019). Multiple testing with the structure-adaptive Benjamini–Hochberg algorithm. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 81(1), 45–74. https://doi.org/10.1111/rssb.12298
      • Yurko, R., G’Sell, M., Roeder, K., & Devlin, B. (2020). A selective inference approach for false discovery rate control using multiomics covariates yields insights into disease risk. Proceedings of the National Academy of Sciences, 117(26), 15028–15035. https://doi.org/10.1073/pnas.1918862117
      • Lei, L., Ramdas, A., & Fithian, W. (2021). A general interactive framework for false discovery rate control under structural constraints. Biometrika, 108(2), 253–267. https://doi.org/10.1093/biomet/asaa064
      • Ignatiadis, N., & Huber, W. (2021). Covariate Powered Cross-Weighted Multiple Testing. Journal of the Royal Statistical Society Series B: Statistical Methodology, 83(4), 720–751. https://doi.org/10.1111/rssb.12411
      • Chao, P., & Fithian, W. (2021). AdaPT-GMM: Powerful and robust covariate-assisted multiple testing (No. arXiv:2106.15812). arXiv. https://doi.org/10.48550/arXiv.2106.15812
    • permutation/rank tests with learned test statistics
    • inference for adaptive experiments
    • narrowing down the null space for complex composite null
      • testing overlap/inference on extremums of conditional mean
        • D’Amour, A., Ding, P., Feller, A., Lei, L., & Sekhon, J. (2021). Overlap in observational studies with high-dimensional covariates. Journal of Econometrics, 221(2), 644–654. https://doi.org/10.1016/j.jeconom.2019.10.014
        • Lei, L. (2023). Distribution-free inference on the extremum of conditional expectations via classification.
      • calibrating black-box ML algorithms
        • Angelopoulos, A. N., Bates, S., Candès, E. J., Jordan, M. I., & Lei, L. (2022). Learn then Test: Calibrating Predictive Algorithms to Achieve Risk Control (No. arXiv:2110.01052). arXiv. https://doi.org/10.48550/arXiv.2110.01052

image

When cannot data splitting answer the right question

  • Case 1: $S(X)$ is given
    • Hypothesis testing after clustering same refs as above
    • Inference after LASSO-selected coefficients
      • Panigrahi, S., & Taylor, J. (2023). Approximate Selective Inference via Maximum Likelihood. Journal of the American Statistical Association, 118(544), 2810–2820. https://doi.org/10.1080/01621459.2022.2081575
      • Panigrahi, S., MacDonald, P. W., & Kessler, D. (2023). Approximate Post-Selective Inference for Regression with the Group LASSO.
    • FDR estimation for a given selection procedure
    • Inference for adaptive experiments
  • Case 2: making decision per data point
    • multiple testing with summary statistics
  • Case 3: heterogeneous data
    • projection parameters in fixed-X regressions
    • survey sampling/finite population causal inference
    • method evaluation for empirical Bayes procedures
    • dependent data (e.g., time series, network data)

image

Introduction

image

only observe a single data point $X\sim N(0, 1)$, split it into parts such that each part contains some information about $X$,

  • $X$ can be reconstructed from both parts taken together
  • but neither part is sufficient by itself to reconstruct $X$
  • and yet the joint distribution of these two parts in known

this example has a simple solution with external randomization.

generate an independent $Z\sim N(0, 1)$, and set $f(X) = X+Z$ and $g(X) = X-Z$

one can then reconstruct $X$ by addition

seek to construct a family of pairs of functions $(f_\tau(X), g_\tau(X))_{\tau\in \cT}$, for some totally ordered set $\cT$, such that we can smoothly trade off the amount of information that each part contains about $X$

  • when $\tau$ approaches $\tau^+:=\sup{\tau: \tau\in \cT}$, $f(X)$ will approach independence from $X$, while $g(X)$ will essentially equal $X$
  • but when $\tau$ approaches $\tau^-:=\inf{\tau:\tau \in\cT}$, opposite
\[f(X) = X-\tau Z, \; g(X) = X + \frac{Z}{\tau}, \cT:=(0, \infty)\]

data fission can achieve the same effort from just a single sample $X$ and not an $n$-dimensional vector

Question: can the above ideas be generalized to other distributions?

can one employ external randomization to split a single data point into two nontrivial parts when the distribution $P$ is not Gaussian?

a positive answer when $P$ is conjugate to some other distribution $Q$, where the latter will be used (along with $X$) to determine the external randomization

An application: data fission for post-selection inference

image

  1. Split $X$ into $f(X)$ and $g(X)$ such that $g(X)\mid f(X)$ is tractable to compute. the parameter $\tau$ controls the proportion of the information to be used for model selection
  2. use $f(X)$ to select a model and/or hypotheses to test
  3. use $g(X)\mid f(X)$ to test hypotheses and/or perform inference

image

Tian and Taylor (2018), Rasines and Young (2022): amount to letting $f(X) = X+\gamma Z$ and $g(X) = X$ with $Z\in N(0, \sigma^2)$ for some fixed constant $\gamma > 0$

  • when $X\sim N(\mu, \sigma^2)$ with known $\sigma^2$, $g(X)\mid f(X)$ has a tractable finite sample distribution
  • in cases where $X$ is non-Gaussian, $g(X)\mid f(X)$ is only described asymptotically

the methodologies can be seen as a compromise between data splitting and the approach of data carving

advantages relative to data splitting in at least two distinct ways

  1. certain settings that involve sparse or rarely occurring features may result in a handful of data points having a disproportionate amount of influence.
  2. in settings where the best selected model is defined relative to a set of fixed covariates, the theoretical justification for data splitting becomes less clear conceptually.

the idea of adding and subtracting Gaussian noise has been employed for various goals in the literature

  • Tian and Taylor (2018): selective inference by introducing a noise variable
  • Xing et al. (2021): leave the response data unperturbed but create randomized versions of the covariates (terms as “Gaussian mirrors”) by adding and subtracting Gaussian noise
  • Li and Fithian (2021): recast the knockoff procedure of Barber and Candes (2015) as a selective inference problem for the linear Gaussian model that adds noise to the OLS estimate ($\hat\beta$) to create a “whitened” version of $\hat\beta$ to use for hypotesis selection
  • Sarkar and Tang (2021): similar ways of using knockoffs to split $\hat\beta$ into independent pieces for hypothesis selection but uses a deterministic splitting procedure
  • Ignatiadis et al. (2021): for empirical Bayes procedures and not selective inference

Outline:

  • Section 2: general methodology for data fission
  • the use of data fission for four different applications
    • Section 3: selective CIs after interactive multiple testing
    • Section 4: fixed-design linear regression
    • Section 5: fixed-design generalized linear models
    • Section 6: trend filtering

Techniques to accomplish data fission

decompositions of $X$ to $f(X)$ and $g(X)$ such that both parts contain information about a parameter $\theta$ of interest, and there exists some function $h$ such that $X = h(f(X), g(X))$, and with either of the following two properties

  • P1: $f(X)$ and $g(X)$ are independent with known distributions (up to the known $\theta$)
  • P2: $f(X)$ has a known marginal distribution and $g(X)$ has a known conditional distribution given $f(X)$ (up to knowledge of $\theta$)

Achieving P2 using “conjugate prior reversal”

suppose $X$ follows a distribution that is a conjugate prior distribution of the parameter in some likelihood, then construct a new random variable $Z$ following that likelihood (with the parameter being $X$).

Letting $f(X) = Z$ and $g(X) = X$, the conditional distribution of $g(X)\mid f(X)$ will have the same form as $X$ (with a different parameter depending on the value of $f(X)$)

  1. suppose $X \sim Exp(\theta)$, which is conjugate prior to the Poisson distribution
  2. draw $Z = (Z_1,\ldots, Z_B)$ where each element is i.i.d. $Z_i \sim Poi(X)$ and $B\in{1,2,\ldots}$ is a tuning parameter
  3. let $f(X) = Z$ and $g(X) = X$, then the conditional distribution of $g(X)\mid f(X)$ is $Gamma(1+\sum_{i=1}^B f_i(X), \theta+B)$. On the other hand, $f(X)\sim Geo(\frac{\theta}{\theta+B})$

for exponential family distributions, we can construct $f(X)$ and $g(X)$ as follows

Example decompositions

Relationship between data splitting and data fission

suppose $n$ i.i.d. observations $X = (X_1,\ldots, X_n)$, where $X_i\sim p(\theta)$.

\[I_X(\theta) = anI_1(\theta) + (1-an) I_1(\theta)\]

on the other hand

\[I_X(\theta) = I_{f(X)}(\theta) + \bbE[I_{g(X)\mid f(X)}(\theta)]\]

For Gaussian datasets, let ${X_i}{i=1}^n$ be iid $N(\theta, \sigma^2)$. $I_1 = \frac{1}{\sigma^2}$ and so $\frac{an}{\sigma^2}$ is the amount of information used for selection under a data splitting rule. If the data is fissioned using $f(X) = X+\tau Z$, then $I{f(X)} = \frac{n}{\sigma^2(1+\tau^2)}$. Note that the relation $a = \frac{1}{1+\tau^2}$ results in the same split of information.

Application: selective CIs after interactive multiple testing

$y_i\sim N(\mu_i, \sigma^2)$ for $n$ data points with known $\sigma^2$ alongside $x_i\in \cX$

goal: choose a subset of hypotheses $\cR$ to reject from the set ${H_{0,i}: \mu_i = 0}$ while controlling the FDR, which is defined as the expected value of the FDP

\[\frac{\vert x_i\in\cR:\mu_i=0\vert}{\max\{\vert\cR\vert, 1\}}\]

after selecting these hypotheses, one wish to construct either

  • multiple CIs with $1-\alpha$ coverage of $\mu_i$ for each $i\in \cR$
  • a single CI with $1-\alpha$ coverage of $\bar\mu = \frac{1}{\vert\cR\vert}\sum_{i\in \cR}\mu_i$

Application: selective CIs in fixed-design linear regression

  • $y_i$: dependent variable
  • $x_i\in \IR^p$: non-random vector of $p$ features for $i = 1,\ldots, n$ samples
  • $X = (x_1,\ldots, x_n)^T$ model design matrix
  • $Y = (y_1,\ldots, y_n)^T$
\[Y = \mu + \epsilon\quad \text{with}\quad \epsilon = (\epsilon_1,\ldots, \epsilon_n)^T\sim N(0, \Sigma)\]

where

  • $\mu = \bbE[Y\mid X]\in \IR^n$ is a fixed unknown quantity
  • $\epsilon \in \IR^n$: random quantity with a known covariance matrix $\Sigma$

adding Gaussian noise $Z\sim N(0,\Sigma)$, letting $f(Y) = Y + \tau Z$ and $g(Y) = Y-\frac{1}{\tau}Z$

  1. use $f(Y)$ to select a model $M\subset [p]$ that defines a model design matrix $X_M$
  2. after selecting $M$, then use $g(Y)$ for inference by fitting a linear regression on $g(Y)$ against the selected covariates $X_M$

let $\hat\beta$ be defined in the usual way as

\[\hat\beta(M) = \argmin_{\tilde \beta} \Vert g(Y) - X_M\tilde\beta\Vert^2 = (X_M^TX_M)^{-1}X_M^Tg(Y)\]

the target parameter is the best linear approximator of the regression function using the selected model

\[\beta^\star(M) = \argmin_{\tilde\beta} \bbE\left[\Vert Y - X_M\tilde\beta\Vert^2 \right] = (X_M^TX_M)^{-1}X_M^T\mu\]

an assumption of this procedure is that the variance is known in order to do the initial split between $f(Y)$ and $g(Y)$ during the fission phase.

data fission will outperform data splitting in settings with small sample size with a handful of points with high leverage

Application to selective CIs in fixed-design generalized linear models and other quasi-MLE problems

it is important to construct CIs that are robust to model misspecification for two reasons

  1. it is unlikely in practical settings that able to select a set of covariates that corresponds to the actual conditional mean of the data
  2. the post-selective distribution $f(Y)\mid g(Y)$ may be ill-behaved and difficult to work with even if $Y$ is easy to model

QMLE (quasi-maximum likelihood estimation): using MLE to train a model but to construct guarantees in an assumption-lean settings

Problem setup and a recap of QMLE methods

suppose $n$ independent random observations $y_i\in \IR$ alongside fixed covariates $x_i\in\IR^p$

Assumption 1: Data fission is conducted such that $g(y_i)\ind g(y_k) \mid f(Y), X$ for all $i\neq k$

Application to post-selection inference for trend filtering and other nonparametric regression problems

\[y_i = f_0(x_i) + \epsilon_i\]
  • $f_0$ is the underlying function to be estimated and $\epsilon_i$ is random noise
  • denote $Y = (y_1,\ldots, y_n)^T, \epsilon = (\epsilon_1,\ldots,\epsilon_n)^T$
  1. decompose $Y$ into $f(Y) = Y+\tau Z$ and $g(Y) = Y - \frac{1}{\tau} Z$ where $Z\sim N(0, \Sigma)$
  2. use $f(X)$ to choose a basis $a_1,\ldots, a_p$ for the series expansion of $x_i$. Let $A$ denote the matrix of basis vectors for all $n$ data points
  3. use $g(X)\mid f(X)$ to construct pointwise or uniform CIs

the fitted line

\[\hat\beta(A) = \argmin_\beta \Vert g(Y) - A\beta\Vert^2 = (A^TA)^{-1}A^Tg(Y)\]

meanwhile, define the projected regression function

\[\beta^\star(A) = \argmin_\beta \bbE\left[ \Vert Y - A\beta\Vert^2 \right]\]

Published in categories JSM-2024