# Data Fission

##### Posted on (Update: ) 0 Comments

This note is for the discussion paper Leiner, J., Duan, B., Wasserman, L., & Ramdas, A. (2023). Data fission: Splitting a single data point (arXiv:2112.11079). arXiv. http://arxiv.org/abs/2112.11079 in the JASA invited session at JSM 2024

- Apr. 2018: Tian and Taylor use external randomization for Gaussian conditional selective inference
- Nov. 2019: Data
~~blurring~~fission research begins - Feb. 2021: Rasines and Young apply Gaussian external randomization to non-Gaussian data with asymptotic error guarantees
- Dec. 2021: Data
~~blurring~~fission preprint posted: initially unware of Rasines and Young (2022) - Jan. 2023: Data thinning [Neufeld et al. (2024)] creates improved decomposition rule for distributions falling into “convolution-based” class (e.g., Gaussian, Poisson, Binomial)
- Neufeld, A., Dharamshi, A., Gao, L. L., & Witten, D. (2024). Data Thinning for Convolution-Closed Distributions. Journal of Machine Learning Research, 25(57), 1–35.

- Mar. 2023: Generalized data thinning [Dharamshi et al. (2024)] unifies a variety of splitting, thinning, and fission procedures via concept of sufficiency
- Jan. 2024: Graph fission [Leiner and Ramdas (2024)] explores fission procedures for inference and trend estimation on graph data sets

Generic selective inference problems

Main object of interest: $P_\theta(T(X)\mid S(X))$

- $\theta$ unknown parameter; $S(X)$ selection; $T(X)$ inference; can be randomized
- Examples:
- inference for coefficients after model selection
- Lee, J. D., Sun, D. L., Sun, Y., & Taylor, J. E. (2016). Exact Post-Selection Inference, with Application to the Lasso. The Annals of Statistics, 44(3), 907–927.
- Fithian, W., Sun, D., & Taylor, J. (2017). Optimal Inference After Model Selection (No. arXiv:1410.2597). arXiv. https://doi.org/10.48550/arXiv.1410.2597
- Tian, X., & Taylor, J. (2017). Asymptotics of Selective Inference. Scandinavian Journal of Statistics, 44(2), 480–499. https://doi.org/10.1111/sjos.12261

- hypothesis testing after clustering
- Gao, L. L., Bien, J., & Witten, D. (2024). Selective Inference for Hierarchical Clustering. Journal of the American Statistical Association, 119(545), 332–342. https://doi.org/10.1080/01621459.2022.2116331
- Yun, Y.-J., & Barber, R. F. (2023). Selective inference for clustering with unknown variance. Electronic Journal of Statistics, 17(2), 1923–1946. https://doi.org/10.1214/23-EJS2143
- González-Delgado, J., Cortés, J., & Neuvial, P. (2023). Post-clustering Inference under Dependency (No. arXiv:2310.11822). arXiv. https://doi.org/10.48550/arXiv.2310.11822

- data-driven ranking of hypotheses for multiple testing
- Barber, R. F., & Candès, E. J. (2015). Controlling the false discovery rate via knockoffs. The Annals of Statistics, 43(5). https://doi.org/10.1214/15-AOS1337
- Candes, E., Fan, Y., Janson, L., & Lv, J. (2017). Panning for Gold: Model-X Knockoffs for High-dimensional Controlled Variable Selection. arXiv:1610.02351 [Math, Stat]. http://arxiv.org/abs/1610.02351
- Lei, L., & Fithian, W. (2018). AdaPT: An Interactive Procedure for Multiple Testing with Side Information. Journal of the Royal Statistical Society Series B: Statistical Methodology, 80(4), 649–679. https://doi.org/10.1111/rssb.12274
- Li, A., & Barber, R. F. (2019). Multiple testing with the structure-adaptive Benjamini–Hochberg algorithm. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 81(1), 45–74. https://doi.org/10.1111/rssb.12298
- Yurko, R., G’Sell, M., Roeder, K., & Devlin, B. (2020). A selective inference approach for false discovery rate control using multiomics covariates yields insights into disease risk. Proceedings of the National Academy of Sciences, 117(26), 15028–15035. https://doi.org/10.1073/pnas.1918862117
- Lei, L., Ramdas, A., & Fithian, W. (2021). A general interactive framework for false discovery rate control under structural constraints. Biometrika, 108(2), 253–267. https://doi.org/10.1093/biomet/asaa064
- Ignatiadis, N., & Huber, W. (2021). Covariate Powered Cross-Weighted Multiple Testing. Journal of the Royal Statistical Society Series B: Statistical Methodology, 83(4), 720–751. https://doi.org/10.1111/rssb.12411
- Chao, P., & Fithian, W. (2021). AdaPT-GMM: Powerful and robust covariate-assisted multiple testing (No. arXiv:2106.15812). arXiv. https://doi.org/10.48550/arXiv.2106.15812

- permutation/rank tests with learned test statistics
- Kuchibhotla, A. K. (2021). Exchangeability, Conformal Prediction, and Rank Tests (No. arXiv:2005.06095). arXiv. http://arxiv.org/abs/2005.06095
- Yang, C.-Y., Lei, L., Ho, N., & Fithian, W. (2021). BONuS: Multiple multivariate testing with a data-adaptivetest statistic (No. arXiv:2106.15743). arXiv. https://doi.org/10.48550/arXiv.2106.15743
- Bates, S., Candès, E., Lei, L., Romano, Y., & Sesia, M. (2023). Testing for outliers with conformal p-values. The Annals of Statistics, 51(1), 149–178. https://doi.org/10.1214/22-AOS2244
- Marandon, A., Lei, L., Mary, D., & Roquain, E. (2023). Adaptive novelty detection with false discovery rate guarantee (No. arXiv:2208.06685). arXiv. https://doi.org/10.48550/arXiv.2208.06685

- inference for adaptive experiments
- Hirano, K., & Porter, J. R. (2023). Asymptotic Representations for Sequential Decisions, Adaptive Experiments, and Batched Bandits (No. arXiv:2302.03117). arXiv. https://doi.org/10.48550/arXiv.2302.03117
- Chen, J., & Andrews, I. (2023). Optimal Conditional Inference in Adaptive Experiments (No. arXiv:2309.12162). arXiv. https://doi.org/10.48550/arXiv.2309.12162

- narrowing down the null space for complex composite null
- testing overlap/inference on extremums of conditional mean
- D’Amour, A., Ding, P., Feller, A., Lei, L., & Sekhon, J. (2021). Overlap in observational studies with high-dimensional covariates. Journal of Econometrics, 221(2), 644–654. https://doi.org/10.1016/j.jeconom.2019.10.014
- Lei, L. (2023). Distribution-free inference on the extremum of conditional expectations via classification.

- calibrating black-box ML algorithms
- Angelopoulos, A. N., Bates, S., Candès, E. J., Jordan, M. I., & Lei, L. (2022). Learn then Test: Calibrating Predictive Algorithms to Achieve Risk Control (No. arXiv:2110.01052). arXiv. https://doi.org/10.48550/arXiv.2110.01052

- testing overlap/inference on extremums of conditional mean

- inference for coefficients after model selection

When cannot data splitting answer the right question

- Case 1: $S(X)$ is given
- Hypothesis testing after clustering
*same refs as above* - Inference after LASSO-selected coefficients
- Panigrahi, S., & Taylor, J. (2023). Approximate Selective Inference via Maximum Likelihood. Journal of the American Statistical Association, 118(544), 2810–2820. https://doi.org/10.1080/01621459.2022.2081575
- Panigrahi, S., MacDonald, P. W., & Kessler, D. (2023). Approximate Post-Selective Inference for Regression with the Group LASSO.

- FDR estimation for a given selection procedure
- Luo, Y., Fithian, W., & Lei, L. (2023). Improving knockoffs with conditional calibration (No. arXiv:2208.09542). arXiv. https://doi.org/10.48550/arXiv.2208.09542

- Inference for adaptive experiments

- Hypothesis testing after clustering
- Case 2: making decision per data point
- multiple testing with summary statistics

- Case 3: heterogeneous data
- projection parameters in fixed-X regressions
- Rasines, D. G., & Young, G. A. (2023). Splitting strategies for post-selection inference. Biometrika, 110(3), 597–614. https://doi.org/10.1093/biomet/asac070

- survey sampling/finite population causal inference
- method evaluation for empirical Bayes procedures
- dependent data (e.g., time series, network data)

- projection parameters in fixed-X regressions

## Introduction

only observe a single data point $X\sim N(0, 1)$, split it into parts such that each part contains some information about $X$,

- $X$ can be reconstructed from both parts taken together
- but neither part is sufficient by itself to reconstruct $X$
- and yet the joint distribution of these two parts in known

this example has a simple solution with external randomization.

generate an independent $Z\sim N(0, 1)$, and set $f(X) = X+Z$ and $g(X) = X-Z$

one can then reconstruct $X$ by addition

seek to construct a family of pairs of functions $(f_\tau(X), g_\tau(X))_{\tau\in \cT}$, for some totally ordered set $\cT$, such that we can smoothly trade off the amount of information that each part contains about $X$

- when $\tau$ approaches $\tau^+:=\sup{\tau: \tau\in \cT}$, $f(X)$ will approach independence from $X$, while $g(X)$ will essentially equal $X$
- but when $\tau$ approaches $\tau^-:=\inf{\tau:\tau \in\cT}$, opposite

data fission can achieve the same effort from just a single sample $X$ and not an $n$-dimensional vector

Question: can the above ideas be generalized to other distributions?

can one employ external randomization to split a single data point into two nontrivial parts when the distribution $P$ is not Gaussian?

a positive answer when $P$ is conjugate to some other distribution $Q$, where the latter will be used (along with $X$) to determine the external randomization

### An application: data fission for post-selection inference

- Split $X$ into $f(X)$ and $g(X)$ such that $g(X)\mid f(X)$ is tractable to compute. the parameter $\tau$ controls the proportion of the information to be used for model selection
- use $f(X)$ to select a model and/or hypotheses to test
- use $g(X)\mid f(X)$ to test hypotheses and/or perform inference

Tian and Taylor (2018), Rasines and Young (2022): amount to letting $f(X) = X+\gamma Z$ and $g(X) = X$ with $Z\in N(0, \sigma^2)$ for some fixed constant $\gamma > 0$

- when $X\sim N(\mu, \sigma^2)$ with known $\sigma^2$, $g(X)\mid f(X)$ has a tractable finite sample distribution
- in cases where $X$ is non-Gaussian, $g(X)\mid f(X)$ is only described asymptotically

the methodologies can be seen as a compromise between data splitting and the approach of data carving

advantages relative to data splitting in at least two distinct ways

- certain settings that involve sparse or rarely occurring features may result in a handful of data points having a disproportionate amount of influence.
- in settings where the best selected model is defined relative to a set of fixed covariates, the theoretical justification for data splitting becomes less clear conceptually.

### Related work on data splitting and carving

the idea of adding and subtracting Gaussian noise has been employed for various goals in the literature

- Tian and Taylor (2018): selective inference by introducing a noise variable
- Xing et al. (2021): leave the response data unperturbed but create randomized versions of the covariates (terms as “Gaussian mirrors”) by adding and subtracting Gaussian noise
- Li and Fithian (2021): recast the knockoff procedure of Barber and Candes (2015) as a selective inference problem for the linear Gaussian model that adds noise to the OLS estimate ($\hat\beta$) to create a “whitened” version of $\hat\beta$ to use for hypotesis selection
- Sarkar and Tang (2021): similar ways of using knockoffs to split $\hat\beta$ into independent pieces for hypothesis selection but uses a deterministic splitting procedure
- Ignatiadis et al. (2021): for empirical Bayes procedures and not selective inference

Outline:

- Section 2: general methodology for data fission
- the use of data fission for four different applications
- Section 3: selective CIs after interactive multiple testing
- Section 4: fixed-design linear regression
- Section 5: fixed-design generalized linear models
- Section 6: trend filtering

## Techniques to accomplish data fission

decompositions of $X$ to $f(X)$ and $g(X)$ such that both parts contain information about a parameter $\theta$ of interest, and there exists some function $h$ such that $X = h(f(X), g(X))$, and with either of the following two properties

- P1: $f(X)$ and $g(X)$ are independent with known distributions (up to the known $\theta$)
- P2: $f(X)$ has a known marginal distribution and $g(X)$ has a known conditional distribution given $f(X)$ (up to knowledge of $\theta$)

### Achieving P2 using “conjugate prior reversal”

suppose $X$ follows a distribution that is a conjugate prior distribution of the parameter in some likelihood, then construct a new random variable $Z$ following that likelihood (with the parameter being $X$).

Letting $f(X) = Z$ and $g(X) = X$, the conditional distribution of $g(X)\mid f(X)$ will have the same form as $X$ (with a different parameter depending on the value of $f(X)$)

- suppose $X \sim Exp(\theta)$, which is conjugate prior to the Poisson distribution
- draw $Z = (Z_1,\ldots, Z_B)$ where each element is i.i.d. $Z_i \sim Poi(X)$ and $B\in{1,2,\ldots}$ is a tuning parameter
- let $f(X) = Z$ and $g(X) = X$, then the conditional distribution of $g(X)\mid f(X)$ is $Gamma(1+\sum_{i=1}^B f_i(X), \theta+B)$. On the other hand, $f(X)\sim Geo(\frac{\theta}{\theta+B})$

for exponential family distributions, we can construct $f(X)$ and $g(X)$ as follows

### Example decompositions

### Relationship between data splitting and data fission

suppose $n$ i.i.d. observations $X = (X_1,\ldots, X_n)$, where $X_i\sim p(\theta)$.

\[I_X(\theta) = anI_1(\theta) + (1-an) I_1(\theta)\]on the other hand

\[I_X(\theta) = I_{f(X)}(\theta) + \bbE[I_{g(X)\mid f(X)}(\theta)]\]For Gaussian datasets, let ${X_i}*{i=1}^n$ be iid $N(\theta, \sigma^2)$. $I_1 = \frac{1}{\sigma^2}$ and so $\frac{an}{\sigma^2}$ is the amount of information used for selection under a data splitting rule. If the data is fissioned using $f(X) = X+\tau Z$, then $I*{f(X)} = \frac{n}{\sigma^2(1+\tau^2)}$. Note that the relation $a = \frac{1}{1+\tau^2}$ results in the same split of information.

## Application: selective CIs after interactive multiple testing

$y_i\sim N(\mu_i, \sigma^2)$ for $n$ data points with known $\sigma^2$ alongside $x_i\in \cX$

goal: choose a subset of hypotheses $\cR$ to reject from the set ${H_{0,i}: \mu_i = 0}$ while controlling the FDR, which is defined as the expected value of the FDP

\[\frac{\vert x_i\in\cR:\mu_i=0\vert}{\max\{\vert\cR\vert, 1\}}\]after selecting these hypotheses, one wish to construct either

- multiple CIs with $1-\alpha$ coverage of $\mu_i$ for each $i\in \cR$
- a single CI with $1-\alpha$ coverage of $\bar\mu = \frac{1}{\vert\cR\vert}\sum_{i\in \cR}\mu_i$

## Application: selective CIs in fixed-design linear regression

- $y_i$: dependent variable
- $x_i\in \IR^p$: non-random vector of $p$ features for $i = 1,\ldots, n$ samples
- $X = (x_1,\ldots, x_n)^T$ model design matrix
- $Y = (y_1,\ldots, y_n)^T$

where

- $\mu = \bbE[Y\mid X]\in \IR^n$ is a fixed unknown quantity
- $\epsilon \in \IR^n$: random quantity with a known covariance matrix $\Sigma$

adding Gaussian noise $Z\sim N(0,\Sigma)$, letting $f(Y) = Y + \tau Z$ and $g(Y) = Y-\frac{1}{\tau}Z$

- use $f(Y)$ to select a model $M\subset [p]$ that defines a model design matrix $X_M$
- after selecting $M$, then use $g(Y)$ for inference by fitting a linear regression on $g(Y)$ against the selected covariates $X_M$

let $\hat\beta$ be defined in the usual way as

\[\hat\beta(M) = \argmin_{\tilde \beta} \Vert g(Y) - X_M\tilde\beta\Vert^2 = (X_M^TX_M)^{-1}X_M^Tg(Y)\]the target parameter is the best linear approximator of the regression function using the selected model

\[\beta^\star(M) = \argmin_{\tilde\beta} \bbE\left[\Vert Y - X_M\tilde\beta\Vert^2 \right] = (X_M^TX_M)^{-1}X_M^T\mu\]**an assumption of this procedure is that the variance is known** in order to do the initial split between $f(Y)$ and $g(Y)$ during the fission phase.

**data fission will outperform data splitting in settings with small sample size with a handful of points with high leverage**

## Application to selective CIs in fixed-design generalized linear models and other quasi-MLE problems

it is important to construct CIs that are robust to model misspecification for two reasons

- it is unlikely in practical settings that able to select a set of covariates that corresponds to the actual conditional mean of the data
- the post-selective distribution $f(Y)\mid g(Y)$ may be ill-behaved and difficult to work with even if $Y$ is easy to model

QMLE (quasi-maximum likelihood estimation): using MLE to train a model but to construct guarantees in an assumption-lean settings

### Problem setup and a recap of QMLE methods

suppose $n$ independent random observations $y_i\in \IR$ alongside fixed covariates $x_i\in\IR^p$

Assumption 1: Data fission is conducted such that $g(y_i)\ind g(y_k) \mid f(Y), X$ for all $i\neq k$

## Application to post-selection inference for trend filtering and other nonparametric regression problems

\[y_i = f_0(x_i) + \epsilon_i\]- $f_0$ is the underlying function to be estimated and $\epsilon_i$ is random noise
- denote $Y = (y_1,\ldots, y_n)^T, \epsilon = (\epsilon_1,\ldots,\epsilon_n)^T$

- decompose $Y$ into $f(Y) = Y+\tau Z$ and $g(Y) = Y - \frac{1}{\tau} Z$ where $Z\sim N(0, \Sigma)$
- use $f(X)$ to choose a basis $a_1,\ldots, a_p$ for the series expansion of $x_i$. Let $A$ denote the matrix of basis vectors for all $n$ data points
- use $g(X)\mid f(X)$ to construct pointwise or uniform CIs

the fitted line

\[\hat\beta(A) = \argmin_\beta \Vert g(Y) - A\beta\Vert^2 = (A^TA)^{-1}A^Tg(Y)\]meanwhile, define the projected regression function

\[\beta^\star(A) = \argmin_\beta \bbE\left[ \Vert Y - A\beta\Vert^2 \right]\]