Data Fission

Posted on Aug 05, 2024 (Update: Jul 06, 2025)

This note is for the discussion paper Leiner, J., Duan, B., Wasserman, L., & Ramdas, A. (2023). Data fission: Splitting a single data point (arXiv:2112.11079). arXiv. http://arxiv.org/abs/2112.11079 in the JASA invited session at JSM 2024

Apr. 2018: Tian and Taylor use external randomization for Gaussian conditional selective inference
Nov. 2019: Data ~~blurring~~ fission research begins
Feb. 2021: Rasines and Young apply Gaussian external randomization to non-Gaussian data with asymptotic error guarantees
Dec. 2021: Data ~~blurring~~ fission preprint posted: initially unware of Rasines and Young (2022)
Jan. 2023: Data thinning [Neufeld et al. (2024)] creates improved decomposition rule for distributions falling into “convolution-based” class (e.g., Gaussian, Poisson, Binomial)
- Neufeld, A., Dharamshi, A., Gao, L. L., & Witten, D. (2024). Data Thinning for Convolution-Closed Distributions. Journal of Machine Learning Research, 25(57), 1–35.
Mar. 2023: Generalized data thinning [Dharamshi et al. (2024)] unifies a variety of splitting, thinning, and fission procedures via concept of sufficiency
Jan. 2024: Graph fission [Leiner and Ramdas (2024)] explores fission procedures for inference and trend estimation on graph data sets

Generic selective inference problems

Main object of interest: $P_\theta(T(X)\mid S(X))$

$\theta$ unknown parameter; $S(X)$ selection; $T(X)$ inference; can be randomized
Examples:
- inference for coefficients after model selection
  - Lee, J. D., Sun, D. L., Sun, Y., & Taylor, J. E. (2016). Exact Post-Selection Inference, with Application to the Lasso. The Annals of Statistics, 44(3), 907–927.
  - Fithian, W., Sun, D., & Taylor, J. (2017). Optimal Inference After Model Selection (No. arXiv:1410.2597). arXiv. https://doi.org/10.48550/arXiv.1410.2597
  - Tian, X., & Taylor, J. (2017). Asymptotics of Selective Inference. Scandinavian Journal of Statistics, 44(2), 480–499. https://doi.org/10.1111/sjos.12261
- hypothesis testing after clustering
  - Gao, L. L., Bien, J., & Witten, D. (2024). Selective Inference for Hierarchical Clustering. Journal of the American Statistical Association, 119(545), 332–342. https://doi.org/10.1080/01621459.2022.2116331
  - Yun, Y.-J., & Barber, R. F. (2023). Selective inference for clustering with unknown variance. Electronic Journal of Statistics, 17(2), 1923–1946. https://doi.org/10.1214/23-EJS2143
  - González-Delgado, J., Cortés, J., & Neuvial, P. (2023). Post-clustering Inference under Dependency (No. arXiv:2310.11822). arXiv. https://doi.org/10.48550/arXiv.2310.11822
- data-driven ranking of hypotheses for multiple testing
  - Barber, R. F., & Candès, E. J. (2015). Controlling the false discovery rate via knockoffs. The Annals of Statistics, 43(5). https://doi.org/10.1214/15-AOS1337
  - Candes, E., Fan, Y., Janson, L., & Lv, J. (2017). Panning for Gold: Model-X Knockoffs for High-dimensional Controlled Variable Selection. arXiv:1610.02351 [Math, Stat]. http://arxiv.org/abs/1610.02351
  - Lei, L., & Fithian, W. (2018). AdaPT: An Interactive Procedure for Multiple Testing with Side Information. Journal of the Royal Statistical Society Series B: Statistical Methodology, 80(4), 649–679. https://doi.org/10.1111/rssb.12274
  - Li, A., & Barber, R. F. (2019). Multiple testing with the structure-adaptive Benjamini–Hochberg algorithm. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 81(1), 45–74. https://doi.org/10.1111/rssb.12298
  - Yurko, R., G’Sell, M., Roeder, K., & Devlin, B. (2020). A selective inference approach for false discovery rate control using multiomics covariates yields insights into disease risk. Proceedings of the National Academy of Sciences, 117(26), 15028–15035. https://doi.org/10.1073/pnas.1918862117
  - Lei, L., Ramdas, A., & Fithian, W. (2021). A general interactive framework for false discovery rate control under structural constraints. Biometrika, 108(2), 253–267. https://doi.org/10.1093/biomet/asaa064
  - Ignatiadis, N., & Huber, W. (2021). Covariate Powered Cross-Weighted Multiple Testing. Journal of the Royal Statistical Society Series B: Statistical Methodology, 83(4), 720–751. https://doi.org/10.1111/rssb.12411
  - Chao, P., & Fithian, W. (2021). AdaPT-GMM: Powerful and robust covariate-assisted multiple testing (No. arXiv:2106.15812). arXiv. https://doi.org/10.48550/arXiv.2106.15812
- permutation/rank tests with learned test statistics
  - Kuchibhotla, A. K. (2021). Exchangeability, Conformal Prediction, and Rank Tests (No. arXiv:2005.06095). arXiv. http://arxiv.org/abs/2005.06095
  - Yang, C.-Y., Lei, L., Ho, N., & Fithian, W. (2021). BONuS: Multiple multivariate testing with a data-adaptivetest statistic (No. arXiv:2106.15743). arXiv. https://doi.org/10.48550/arXiv.2106.15743
  - Bates, S., Candès, E., Lei, L., Romano, Y., & Sesia, M. (2023). Testing for outliers with conformal p-values. The Annals of Statistics, 51(1), 149–178. https://doi.org/10.1214/22-AOS2244
  - Marandon, A., Lei, L., Mary, D., & Roquain, E. (2023). Adaptive novelty detection with false discovery rate guarantee (No. arXiv:2208.06685). arXiv. https://doi.org/10.48550/arXiv.2208.06685
- inference for adaptive experiments
  - Hirano, K., & Porter, J. R. (2023). Asymptotic Representations for Sequential Decisions, Adaptive Experiments, and Batched Bandits (No. arXiv:2302.03117). arXiv. https://doi.org/10.48550/arXiv.2302.03117
  - Chen, J., & Andrews, I. (2023). Optimal Conditional Inference in Adaptive Experiments (No. arXiv:2309.12162). arXiv. https://doi.org/10.48550/arXiv.2309.12162
- narrowing down the null space for complex composite null
  - testing overlap/inference on extremums of conditional mean
    - D’Amour, A., Ding, P., Feller, A., Lei, L., & Sekhon, J. (2021). Overlap in observational studies with high-dimensional covariates. Journal of Econometrics, 221(2), 644–654. https://doi.org/10.1016/j.jeconom.2019.10.014
    - Lei, L. (2023). Distribution-free inference on the extremum of conditional expectations via classification.
  - calibrating black-box ML algorithms
    - Angelopoulos, A. N., Bates, S., Candès, E. J., Jordan, M. I., & Lei, L. (2022). Learn then Test: Calibrating Predictive Algorithms to Achieve Risk Control (No. arXiv:2110.01052). arXiv. https://doi.org/10.48550/arXiv.2110.01052

When cannot data splitting answer the right question

Case 1: $S(X)$ is given
- Hypothesis testing after clustering same refs as above
- Inference after LASSO-selected coefficients
  - Panigrahi, S., & Taylor, J. (2023). Approximate Selective Inference via Maximum Likelihood. Journal of the American Statistical Association, 118(544), 2810–2820. https://doi.org/10.1080/01621459.2022.2081575
  - Panigrahi, S., MacDonald, P. W., & Kessler, D. (2023). Approximate Post-Selective Inference for Regression with the Group LASSO.
- FDR estimation for a given selection procedure
  - Luo, Y., Fithian, W., & Lei, L. (2023). Improving knockoffs with conditional calibration (No. arXiv:2208.09542). arXiv. https://doi.org/10.48550/arXiv.2208.09542
- Inference for adaptive experiments
Case 2: making decision per data point
- multiple testing with summary statistics
Case 3: heterogeneous data
- projection parameters in fixed-X regressions
  - Rasines, D. G., & Young, G. A. (2023). Splitting strategies for post-selection inference. Biometrika, 110(3), 597–614. https://doi.org/10.1093/biomet/asac070
- survey sampling/finite population causal inference
- method evaluation for empirical Bayes procedures
- dependent data (e.g., time series, network data)

Introduction

only observe a single data point $X\sim N(0, 1)$, split it into parts such that each part contains some information about $X$,

$X$ can be reconstructed from both parts taken together
but neither part is sufficient by itself to reconstruct $X$
and yet the joint distribution of these two parts in known

this example has a simple solution with external randomization.

generate an independent $Z\sim N(0, 1)$, and set $f(X) = X+Z$ and $g(X) = X-Z$

one can then reconstruct $X$ by addition

seek to construct a family of pairs of functions $(f_\tau(X), g_\tau(X))_{\tau\in \cT}$, for some totally ordered set $\cT$, such that we can smoothly trade off the amount of information that each part contains about $X$

when $\tau$ approaches $\tau^+:=\sup{\tau: \tau\in \cT}$, $f(X)$ will approach independence from $X$, while $g(X)$ will essentially equal $X$
but when $\tau$ approaches $\tau^-:=\inf{\tau:\tau \in\cT}$, opposite

\[f(X) = X-\tau Z, \; g(X) = X + \frac{Z}{\tau}, \cT:=(0, \infty)\]

data fission can achieve the same effort from just a single sample $X$ and not an $n$-dimensional vector

Question: can the above ideas be generalized to other distributions?

can one employ external randomization to split a single data point into two nontrivial parts when the distribution $P$ is not Gaussian?

a positive answer when $P$ is conjugate to some other distribution $Q$, where the latter will be used (along with $X$) to determine the external randomization

An application: data fission for post-selection inference

Split $X$ into $f(X)$ and $g(X)$ such that $g(X)\mid f(X)$ is tractable to compute. the parameter $\tau$ controls the proportion of the information to be used for model selection
use $f(X)$ to select a model and/or hypotheses to test
use $g(X)\mid f(X)$ to test hypotheses and/or perform inference

Tian and Taylor (2018), Rasines and Young (2022): amount to letting $f(X) = X+\gamma Z$ and $g(X) = X$ with $Z\in N(0, \sigma^2)$ for some fixed constant $\gamma > 0$

when $X\sim N(\mu, \sigma^2)$ with known $\sigma^2$, $g(X)\mid f(X)$ has a tractable finite sample distribution
in cases where $X$ is non-Gaussian, $g(X)\mid f(X)$ is only described asymptotically

the methodologies can be seen as a compromise between data splitting and the approach of data carving

advantages relative to data splitting in at least two distinct ways

certain settings that involve sparse or rarely occurring features may result in a handful of data points having a disproportionate amount of influence.
in settings where the best selected model is defined relative to a set of fixed covariates, the theoretical justification for data splitting becomes less clear conceptually.

the idea of adding and subtracting Gaussian noise has been employed for various goals in the literature

Tian and Taylor (2018): selective inference by introducing a noise variable
Xing et al. (2021): leave the response data unperturbed but create randomized versions of the covariates (terms as “Gaussian mirrors”) by adding and subtracting Gaussian noise
Li and Fithian (2021): recast the knockoff procedure of Barber and Candes (2015) as a selective inference problem for the linear Gaussian model that adds noise to the OLS estimate ($\hat\beta$) to create a “whitened” version of $\hat\beta$ to use for hypotesis selection
Sarkar and Tang (2021): similar ways of using knockoffs to split $\hat\beta$ into independent pieces for hypothesis selection but uses a deterministic splitting procedure
Ignatiadis et al. (2021): for empirical Bayes procedures and not selective inference

Outline:

Section 2: general methodology for data fission
the use of data fission for four different applications
- Section 3: selective CIs after interactive multiple testing
- Section 4: fixed-design linear regression
- Section 5: fixed-design generalized linear models
- Section 6: trend filtering

Techniques to accomplish data fission

decompositions of $X$ to $f(X)$ and $g(X)$ such that both parts contain information about a parameter $\theta$ of interest, and there exists some function $h$ such that $X = h(f(X), g(X))$, and with either of the following two properties

P1: $f(X)$ and $g(X)$ are independent with known distributions (up to the known $\theta$)
P2: $f(X)$ has a known marginal distribution and $g(X)$ has a known conditional distribution given $f(X)$ (up to knowledge of $\theta$)

Achieving P2 using “conjugate prior reversal”

suppose $X$ follows a distribution that is a conjugate prior distribution of the parameter in some likelihood, then construct a new random variable $Z$ following that likelihood (with the parameter being $X$).

Letting $f(X) = Z$ and $g(X) = X$, the conditional distribution of $g(X)\mid f(X)$ will have the same form as $X$ (with a different parameter depending on the value of $f(X)$)

suppose $X \sim Exp(\theta)$, which is conjugate prior to the Poisson distribution
draw $Z = (Z_1,\ldots, Z_B)$ where each element is i.i.d. $Z_i \sim Poi(X)$ and $B\in{1,2,\ldots}$ is a tuning parameter
let $f(X) = Z$ and $g(X) = X$, then the conditional distribution of $g(X)\mid f(X)$ is $Gamma(1+\sum_{i=1}^B f_i(X), \theta+B)$. On the other hand, $f(X)\sim Geo(\frac{\theta}{\theta+B})$

for exponential family distributions, we can construct $f(X)$ and $g(X)$ as follows

Example decompositions

Relationship between data splitting and data fission

suppose $n$ i.i.d. observations $X = (X_1,\ldots, X_n)$, where $X_i\sim p(\theta)$.

\[I_X(\theta) = anI_1(\theta) + (1-an) I_1(\theta)\]

on the other hand

\[I_X(\theta) = I_{f(X)}(\theta) + \bbE[I_{g(X)\mid f(X)}(\theta)]\]

For Gaussian datasets, let $\{X_i\}_{i=1}^n$ be iid $N(\theta, \sigma^2)$. $I_1 = \frac{1}{\sigma^2}$ and so $\frac{an}{\sigma^2}$ is the amount of information used for selection under a data splitting rule. If the data is fissioned using $f(X) = X+\tau Z$, then $I_{f(X)} = \frac{n}{\sigma^2(1+\tau^2)}$. Note that the relation $a = \frac{1}{1+\tau^2}$ results in the same split of information.

Application: selective CIs after interactive multiple testing

$y_i\sim N(\mu_i, \sigma^2)$ for $n$ data points with known $\sigma^2$ alongside $x_i\in \cX$

goal: choose a subset of hypotheses $\cR$ to reject from the set ${H_{0,i}: \mu_i = 0}$ while controlling the FDR, which is defined as the expected value of the FDP

\[\frac{\vert x_i\in\cR:\mu_i=0\vert}{\max\{\vert\cR\vert, 1\}}\]

after selecting these hypotheses, one wish to construct either

multiple CIs with $1-\alpha$ coverage of $\mu_i$ for each $i\in \cR$
a single CI with $1-\alpha$ coverage of $\bar\mu = \frac{1}{\vert\cR\vert}\sum_{i\in \cR}\mu_i$

Application: selective CIs in fixed-design linear regression

$y_i$: dependent variable
$x_i\in \IR^p$: non-random vector of $p$ features for $i = 1,\ldots, n$ samples
$X = (x_1,\ldots, x_n)^T$ model design matrix
$Y = (y_1,\ldots, y_n)^T$

\[Y = \mu + \epsilon\quad \text{with}\quad \epsilon = (\epsilon_1,\ldots, \epsilon_n)^T\sim N(0, \Sigma)\]

where

$\mu = \bbE[Y\mid X]\in \IR^n$ is a fixed unknown quantity
$\epsilon \in \IR^n$: random quantity with a known covariance matrix $\Sigma$

adding Gaussian noise $Z\sim N(0,\Sigma)$, letting $f(Y) = Y + \tau Z$ and $g(Y) = Y-\frac{1}{\tau}Z$

use $f(Y)$ to select a model $M\subset [p]$ that defines a model design matrix $X_M$
after selecting $M$, then use $g(Y)$ for inference by fitting a linear regression on $g(Y)$ against the selected covariates $X_M$

let $\hat\beta$ be defined in the usual way as

\[\hat\beta(M) = \argmin_{\tilde \beta} \Vert g(Y) - X_M\tilde\beta\Vert^2 = (X_M^TX_M)^{-1}X_M^Tg(Y)\]

the target parameter is the best linear approximator of the regression function using the selected model

\[\beta^\star(M) = \argmin_{\tilde\beta} \bbE\left[\Vert Y - X_M\tilde\beta\Vert^2 \right] = (X_M^TX_M)^{-1}X_M^T\mu\]

an assumption of this procedure is that the variance is known in order to do the initial split between $f(Y)$ and $g(Y)$ during the fission phase.

data fission will outperform data splitting in settings with small sample size with a handful of points with high leverage

5. Application to selective CIs in fixed-design generalized linear models and other quasi-MLE problems

it is important to construct CIs that are robust to model misspecification for two reasons

it is unlikely in practical settings that able to select a set of covariates that corresponds to the actual conditional mean of the data
the post-selective distribution $f(Y)\mid g(Y)$ may be ill-behaved and difficult to work with even if $Y$ is easy to model

QMLE (quasi-maximum likelihood estimation): using MLE to train a model but to construct guarantees in an assumption-lean settings

most work in this setting are designed around a background assumption where both the covariates and response are treated as iid random variables
- such theory is inapplicable to data fission since the covariates are observed during the model selection stage and therefore inference on $g(Y)$ is only valid if both $X$ and $f(Y)$ are conditioned on.
Fahrmeir (1990), works in the setting of non-identically distributed but independent data with fixed covariates and therefore can be applied to this setting.

5.1 Problem setup and a recap of QMLE methods

suppose $n$ independent random observations $y_i\in \IR$ alongside fixed covariates $x_i\in\IR^p$

Assumption 1: Data fission is conducted such that $g(y_i)\ind g(y_k) \mid f(Y), X$ for all $i\neq k$

during model selection, a model $M\subseteq [p]$ is chosen from $f(Y)$ which result in an annealed set of covariates $\tilde x_i \in \IR^{\vert M\vert}$ and model design matrix $X_M$

then find an estimate for $\beta$ using a working model of the density for $g(y_i)\mid f(Y), X$ as $p(g(y_i)\mid \beta, f(Y), X_M)$, which defines the quasi-likelihood function

\[L_n(\beta) := \sum_{i=1}^n \log p(g(y_i)\mid \beta, f(Y), X_M)\]

denote the score function

\[s_n(\beta) = \frac{\partial L_n}{\partial \beta}\]

as well as the quantities

\[H_n(\beta) := -\frac{\partial^2L_n}{\partial \beta\beta^T}\]

and $V_n(\beta) = \text{var}(s_n(\beta))$

If $\bbE[H_n(\beta)]$ is positive definite, the target parameter $\beta^\star(M)$ is the root of $E[s_n(\beta)]$ which is also the parameter which minimizes the KL distance between the true distribution (denoted as $q(g(y_i\mid f(Y), X))$) and the working model because

\[\beta_n^\star(M) = \argmin_\beta \bbE[\log q(g(y_i)\mid X, f(Y)) - L_n(\beta)] = \argmin_\beta D_{\text{KL}}\left(\prod_{i=1}^n q(g(y_i)\mid X, f(Y))\mid \prod_{i=1}^n p(g(y_i)\mid \beta, f(Y), X_M)\right)\]

define $\hat\beta_n(M)$ as the quasi-likelihood maximizer

Application to post-selection inference for trend filtering and other nonparametric regression problems

\[y_i = f_0(x_i) + \epsilon_i\]

$f_0$ is the underlying function to be estimated and $\epsilon_i$ is random noise
denote $Y = (y_1,\ldots, y_n)^T, \epsilon = (\epsilon_1,\ldots,\epsilon_n)^T$

decompose $Y$ into $f(Y) = Y+\tau Z$ and $g(Y) = Y - \frac{1}{\tau} Z$ where $Z\sim N(0, \Sigma)$
use $f(X)$ to choose a basis $a_1,\ldots, a_p$ for the series expansion of $x_i$. Let $A$ denote the matrix of basis vectors for all $n$ data points
use $g(X)\mid f(X)$ to construct pointwise or uniform CIs

the fitted line

\[\hat\beta(A) = \argmin_\beta \Vert g(Y) - A\beta\Vert^2 = (A^TA)^{-1}A^Tg(Y)\]

meanwhile, define the projected regression function

\[\beta^\star(A) = \argmin_\beta \bbE\left[ \Vert Y - A\beta\Vert^2 \right]\]

Rejoinder 3: Data Fission for Test Errror Estimation

This section comes from Leiner, J., Duan, B., Wasserman, L., & Ramdas, A. (2025). Rejoinder. Journal of the American Statistical Association, 120(549), 180–185.

distinguish two related but ultimately distinct applications of data fission:

Inference: split data $X$ into $f(X)$ and $g(X)$. The analyst then uses $f(X)$ to select a model (i.e., identify the covariates with non-zero coefficients) and uses $g(X)$ to perform inference (conditioning on $f(X)$ to account for the data-dependent selection event).
Cross-validation: splits data $X$ into $f(X)$ and $g(X)$. $f(X)$ is used to select and train the model, then $g(X)$ is used to estimate the out-of-sample error of the model trained using $f(X)$

Published in categories JSM-2024

← previous next →

See all posts →

WeiYa's Work Yard

A traveler with endless curiosity, who fell into the ocean of statistics, tries to write down his ideas and notes to save himself.

Data Fission

Posted on Aug 05, 2024 (Update: Jul 06, 2025)

Introduction

An application: data fission for post-selection inference

Techniques to accomplish data fission

Achieving P2 using “conjugate prior reversal”

Example decompositions

Relationship between data splitting and data fission

Application: selective CIs after interactive multiple testing

Application: selective CIs in fixed-design linear regression

5. Application to selective CIs in fixed-design generalized linear models and other quasi-MLE problems

5.1 Problem setup and a recap of QMLE methods

Application to post-selection inference for trend filtering and other nonparametric regression problems

Rejoinder 3: Data Fission for Test Errror Estimation

WeiYa's Work Yard

A traveler with endless curiosity, who fell into the ocean of statistics, tries to write down his ideas and notes to save himself.

Data Fission

Posted on Aug 05, 2024 (Update: Jul 06, 2025)

Introduction

An application: data fission for post-selection inference

Related work on data splitting and carving

Techniques to accomplish data fission

Achieving P2 using “conjugate prior reversal”

Example decompositions

Relationship between data splitting and data fission

Application: selective CIs after interactive multiple testing

Application: selective CIs in fixed-design linear regression

5. Application to selective CIs in fixed-design generalized linear models and other quasi-MLE problems

5.1 Problem setup and a recap of QMLE methods

Application to post-selection inference for trend filtering and other nonparametric regression problems

Rejoinder 3: Data Fission for Test Errror Estimation