Data Fission
Posted on (Update: ) 0 Comments
This note is for the discussion paper Leiner, J., Duan, B., Wasserman, L., & Ramdas, A. (2023). Data fission: Splitting a single data point (arXiv:2112.11079). arXiv. http://arxiv.org/abs/2112.11079 in the JASA invited session at JSM 2024
- Apr. 2018: Tian and Taylor use external randomization for Gaussian conditional selective inference
- Nov. 2019: Data
blurringfission research begins - Feb. 2021: Rasines and Young apply Gaussian external randomization to non-Gaussian data with asymptotic error guarantees
- Dec. 2021: Data
blurringfission preprint posted: initially unware of Rasines and Young (2022) - Jan. 2023: Data thinning [Neufeld et al. (2024)] creates improved decomposition rule for distributions falling into “convolution-based” class (e.g., Gaussian, Poisson, Binomial)
- Neufeld, A., Dharamshi, A., Gao, L. L., & Witten, D. (2024). Data Thinning for Convolution-Closed Distributions. Journal of Machine Learning Research, 25(57), 1–35.
- Mar. 2023: Generalized data thinning [Dharamshi et al. (2024)] unifies a variety of splitting, thinning, and fission procedures via concept of sufficiency
- Jan. 2024: Graph fission [Leiner and Ramdas (2024)] explores fission procedures for inference and trend estimation on graph data sets
Generic selective inference problems
Main object of interest: $P_\theta(T(X)\mid S(X))$
- $\theta$ unknown parameter; $S(X)$ selection; $T(X)$ inference; can be randomized
- Examples:
- inference for coefficients after model selection
- Lee, J. D., Sun, D. L., Sun, Y., & Taylor, J. E. (2016). Exact Post-Selection Inference, with Application to the Lasso. The Annals of Statistics, 44(3), 907–927.
- Fithian, W., Sun, D., & Taylor, J. (2017). Optimal Inference After Model Selection (No. arXiv:1410.2597). arXiv. https://doi.org/10.48550/arXiv.1410.2597
- Tian, X., & Taylor, J. (2017). Asymptotics of Selective Inference. Scandinavian Journal of Statistics, 44(2), 480–499. https://doi.org/10.1111/sjos.12261
- hypothesis testing after clustering
- Gao, L. L., Bien, J., & Witten, D. (2024). Selective Inference for Hierarchical Clustering. Journal of the American Statistical Association, 119(545), 332–342. https://doi.org/10.1080/01621459.2022.2116331
- Yun, Y.-J., & Barber, R. F. (2023). Selective inference for clustering with unknown variance. Electronic Journal of Statistics, 17(2), 1923–1946. https://doi.org/10.1214/23-EJS2143
- González-Delgado, J., Cortés, J., & Neuvial, P. (2023). Post-clustering Inference under Dependency (No. arXiv:2310.11822). arXiv. https://doi.org/10.48550/arXiv.2310.11822
- data-driven ranking of hypotheses for multiple testing
- Barber, R. F., & Candès, E. J. (2015). Controlling the false discovery rate via knockoffs. The Annals of Statistics, 43(5). https://doi.org/10.1214/15-AOS1337
- Candes, E., Fan, Y., Janson, L., & Lv, J. (2017). Panning for Gold: Model-X Knockoffs for High-dimensional Controlled Variable Selection. arXiv:1610.02351 [Math, Stat]. http://arxiv.org/abs/1610.02351
- Lei, L., & Fithian, W. (2018). AdaPT: An Interactive Procedure for Multiple Testing with Side Information. Journal of the Royal Statistical Society Series B: Statistical Methodology, 80(4), 649–679. https://doi.org/10.1111/rssb.12274
- Li, A., & Barber, R. F. (2019). Multiple testing with the structure-adaptive Benjamini–Hochberg algorithm. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 81(1), 45–74. https://doi.org/10.1111/rssb.12298
- Yurko, R., G’Sell, M., Roeder, K., & Devlin, B. (2020). A selective inference approach for false discovery rate control using multiomics covariates yields insights into disease risk. Proceedings of the National Academy of Sciences, 117(26), 15028–15035. https://doi.org/10.1073/pnas.1918862117
- Lei, L., Ramdas, A., & Fithian, W. (2021). A general interactive framework for false discovery rate control under structural constraints. Biometrika, 108(2), 253–267. https://doi.org/10.1093/biomet/asaa064
- Ignatiadis, N., & Huber, W. (2021). Covariate Powered Cross-Weighted Multiple Testing. Journal of the Royal Statistical Society Series B: Statistical Methodology, 83(4), 720–751. https://doi.org/10.1111/rssb.12411
- Chao, P., & Fithian, W. (2021). AdaPT-GMM: Powerful and robust covariate-assisted multiple testing (No. arXiv:2106.15812). arXiv. https://doi.org/10.48550/arXiv.2106.15812
- permutation/rank tests with learned test statistics
- Kuchibhotla, A. K. (2021). Exchangeability, Conformal Prediction, and Rank Tests (No. arXiv:2005.06095). arXiv. http://arxiv.org/abs/2005.06095
- Yang, C.-Y., Lei, L., Ho, N., & Fithian, W. (2021). BONuS: Multiple multivariate testing with a data-adaptivetest statistic (No. arXiv:2106.15743). arXiv. https://doi.org/10.48550/arXiv.2106.15743
- Bates, S., Candès, E., Lei, L., Romano, Y., & Sesia, M. (2023). Testing for outliers with conformal p-values. The Annals of Statistics, 51(1), 149–178. https://doi.org/10.1214/22-AOS2244
- Marandon, A., Lei, L., Mary, D., & Roquain, E. (2023). Adaptive novelty detection with false discovery rate guarantee (No. arXiv:2208.06685). arXiv. https://doi.org/10.48550/arXiv.2208.06685
- inference for adaptive experiments
- Hirano, K., & Porter, J. R. (2023). Asymptotic Representations for Sequential Decisions, Adaptive Experiments, and Batched Bandits (No. arXiv:2302.03117). arXiv. https://doi.org/10.48550/arXiv.2302.03117
- Chen, J., & Andrews, I. (2023). Optimal Conditional Inference in Adaptive Experiments (No. arXiv:2309.12162). arXiv. https://doi.org/10.48550/arXiv.2309.12162
- narrowing down the null space for complex composite null
- testing overlap/inference on extremums of conditional mean
- D’Amour, A., Ding, P., Feller, A., Lei, L., & Sekhon, J. (2021). Overlap in observational studies with high-dimensional covariates. Journal of Econometrics, 221(2), 644–654. https://doi.org/10.1016/j.jeconom.2019.10.014
- Lei, L. (2023). Distribution-free inference on the extremum of conditional expectations via classification.
- calibrating black-box ML algorithms
- Angelopoulos, A. N., Bates, S., Candès, E. J., Jordan, M. I., & Lei, L. (2022). Learn then Test: Calibrating Predictive Algorithms to Achieve Risk Control (No. arXiv:2110.01052). arXiv. https://doi.org/10.48550/arXiv.2110.01052
- testing overlap/inference on extremums of conditional mean
- inference for coefficients after model selection
When cannot data splitting answer the right question
- Case 1: $S(X)$ is given
- Hypothesis testing after clustering same refs as above
- Inference after LASSO-selected coefficients
- Panigrahi, S., & Taylor, J. (2023). Approximate Selective Inference via Maximum Likelihood. Journal of the American Statistical Association, 118(544), 2810–2820. https://doi.org/10.1080/01621459.2022.2081575
- Panigrahi, S., MacDonald, P. W., & Kessler, D. (2023). Approximate Post-Selective Inference for Regression with the Group LASSO.
- FDR estimation for a given selection procedure
- Luo, Y., Fithian, W., & Lei, L. (2023). Improving knockoffs with conditional calibration (No. arXiv:2208.09542). arXiv. https://doi.org/10.48550/arXiv.2208.09542
- Inference for adaptive experiments
- Case 2: making decision per data point
- multiple testing with summary statistics
- Case 3: heterogeneous data
- projection parameters in fixed-X regressions
- Rasines, D. G., & Young, G. A. (2023). Splitting strategies for post-selection inference. Biometrika, 110(3), 597–614. https://doi.org/10.1093/biomet/asac070
- survey sampling/finite population causal inference
- method evaluation for empirical Bayes procedures
- dependent data (e.g., time series, network data)
- projection parameters in fixed-X regressions
Introduction
only observe a single data point $X\sim N(0, 1)$, split it into parts such that each part contains some information about $X$,
- $X$ can be reconstructed from both parts taken together
- but neither part is sufficient by itself to reconstruct $X$
- and yet the joint distribution of these two parts in known
this example has a simple solution with external randomization.
generate an independent $Z\sim N(0, 1)$, and set $f(X) = X+Z$ and $g(X) = X-Z$
one can then reconstruct $X$ by addition
seek to construct a family of pairs of functions $(f_\tau(X), g_\tau(X))_{\tau\in \cT}$, for some totally ordered set $\cT$, such that we can smoothly trade off the amount of information that each part contains about $X$
- when $\tau$ approaches $\tau^+:=\sup{\tau: \tau\in \cT}$, $f(X)$ will approach independence from $X$, while $g(X)$ will essentially equal $X$
- but when $\tau$ approaches $\tau^-:=\inf{\tau:\tau \in\cT}$, opposite
data fission can achieve the same effort from just a single sample $X$ and not an $n$-dimensional vector
Question: can the above ideas be generalized to other distributions?
can one employ external randomization to split a single data point into two nontrivial parts when the distribution $P$ is not Gaussian?
a positive answer when $P$ is conjugate to some other distribution $Q$, where the latter will be used (along with $X$) to determine the external randomization
An application: data fission for post-selection inference
- Split $X$ into $f(X)$ and $g(X)$ such that $g(X)\mid f(X)$ is tractable to compute. the parameter $\tau$ controls the proportion of the information to be used for model selection
- use $f(X)$ to select a model and/or hypotheses to test
- use $g(X)\mid f(X)$ to test hypotheses and/or perform inference
Tian and Taylor (2018), Rasines and Young (2022): amount to letting $f(X) = X+\gamma Z$ and $g(X) = X$ with $Z\in N(0, \sigma^2)$ for some fixed constant $\gamma > 0$
- when $X\sim N(\mu, \sigma^2)$ with known $\sigma^2$, $g(X)\mid f(X)$ has a tractable finite sample distribution
- in cases where $X$ is non-Gaussian, $g(X)\mid f(X)$ is only described asymptotically
the methodologies can be seen as a compromise between data splitting and the approach of data carving
advantages relative to data splitting in at least two distinct ways
- certain settings that involve sparse or rarely occurring features may result in a handful of data points having a disproportionate amount of influence.
- in settings where the best selected model is defined relative to a set of fixed covariates, the theoretical justification for data splitting becomes less clear conceptually.
Related work on data splitting and carving
the idea of adding and subtracting Gaussian noise has been employed for various goals in the literature
- Tian and Taylor (2018): selective inference by introducing a noise variable
- Xing et al. (2021): leave the response data unperturbed but create randomized versions of the covariates (terms as “Gaussian mirrors”) by adding and subtracting Gaussian noise
- Li and Fithian (2021): recast the knockoff procedure of Barber and Candes (2015) as a selective inference problem for the linear Gaussian model that adds noise to the OLS estimate ($\hat\beta$) to create a “whitened” version of $\hat\beta$ to use for hypotesis selection
- Sarkar and Tang (2021): similar ways of using knockoffs to split $\hat\beta$ into independent pieces for hypothesis selection but uses a deterministic splitting procedure
- Ignatiadis et al. (2021): for empirical Bayes procedures and not selective inference
Outline:
- Section 2: general methodology for data fission
- the use of data fission for four different applications
- Section 3: selective CIs after interactive multiple testing
- Section 4: fixed-design linear regression
- Section 5: fixed-design generalized linear models
- Section 6: trend filtering
Techniques to accomplish data fission
decompositions of $X$ to $f(X)$ and $g(X)$ such that both parts contain information about a parameter $\theta$ of interest, and there exists some function $h$ such that $X = h(f(X), g(X))$, and with either of the following two properties
- P1: $f(X)$ and $g(X)$ are independent with known distributions (up to the known $\theta$)
- P2: $f(X)$ has a known marginal distribution and $g(X)$ has a known conditional distribution given $f(X)$ (up to knowledge of $\theta$)
Achieving P2 using “conjugate prior reversal”
suppose $X$ follows a distribution that is a conjugate prior distribution of the parameter in some likelihood, then construct a new random variable $Z$ following that likelihood (with the parameter being $X$).
Letting $f(X) = Z$ and $g(X) = X$, the conditional distribution of $g(X)\mid f(X)$ will have the same form as $X$ (with a different parameter depending on the value of $f(X)$)
- suppose $X \sim Exp(\theta)$, which is conjugate prior to the Poisson distribution
- draw $Z = (Z_1,\ldots, Z_B)$ where each element is i.i.d. $Z_i \sim Poi(X)$ and $B\in{1,2,\ldots}$ is a tuning parameter
- let $f(X) = Z$ and $g(X) = X$, then the conditional distribution of $g(X)\mid f(X)$ is $Gamma(1+\sum_{i=1}^B f_i(X), \theta+B)$. On the other hand, $f(X)\sim Geo(\frac{\theta}{\theta+B})$
for exponential family distributions, we can construct $f(X)$ and $g(X)$ as follows
Example decompositions
Relationship between data splitting and data fission
suppose $n$ i.i.d. observations $X = (X_1,\ldots, X_n)$, where $X_i\sim p(\theta)$.
\[I_X(\theta) = anI_1(\theta) + (1-an) I_1(\theta)\]on the other hand
\[I_X(\theta) = I_{f(X)}(\theta) + \bbE[I_{g(X)\mid f(X)}(\theta)]\]For Gaussian datasets, let ${X_i}{i=1}^n$ be iid $N(\theta, \sigma^2)$. $I_1 = \frac{1}{\sigma^2}$ and so $\frac{an}{\sigma^2}$ is the amount of information used for selection under a data splitting rule. If the data is fissioned using $f(X) = X+\tau Z$, then $I{f(X)} = \frac{n}{\sigma^2(1+\tau^2)}$. Note that the relation $a = \frac{1}{1+\tau^2}$ results in the same split of information.
Application: selective CIs after interactive multiple testing
$y_i\sim N(\mu_i, \sigma^2)$ for $n$ data points with known $\sigma^2$ alongside $x_i\in \cX$
goal: choose a subset of hypotheses $\cR$ to reject from the set ${H_{0,i}: \mu_i = 0}$ while controlling the FDR, which is defined as the expected value of the FDP
\[\frac{\vert x_i\in\cR:\mu_i=0\vert}{\max\{\vert\cR\vert, 1\}}\]after selecting these hypotheses, one wish to construct either
- multiple CIs with $1-\alpha$ coverage of $\mu_i$ for each $i\in \cR$
- a single CI with $1-\alpha$ coverage of $\bar\mu = \frac{1}{\vert\cR\vert}\sum_{i\in \cR}\mu_i$
Application: selective CIs in fixed-design linear regression
- $y_i$: dependent variable
- $x_i\in \IR^p$: non-random vector of $p$ features for $i = 1,\ldots, n$ samples
- $X = (x_1,\ldots, x_n)^T$ model design matrix
- $Y = (y_1,\ldots, y_n)^T$
where
- $\mu = \bbE[Y\mid X]\in \IR^n$ is a fixed unknown quantity
- $\epsilon \in \IR^n$: random quantity with a known covariance matrix $\Sigma$
adding Gaussian noise $Z\sim N(0,\Sigma)$, letting $f(Y) = Y + \tau Z$ and $g(Y) = Y-\frac{1}{\tau}Z$
- use $f(Y)$ to select a model $M\subset [p]$ that defines a model design matrix $X_M$
- after selecting $M$, then use $g(Y)$ for inference by fitting a linear regression on $g(Y)$ against the selected covariates $X_M$
let $\hat\beta$ be defined in the usual way as
\[\hat\beta(M) = \argmin_{\tilde \beta} \Vert g(Y) - X_M\tilde\beta\Vert^2 = (X_M^TX_M)^{-1}X_M^Tg(Y)\]the target parameter is the best linear approximator of the regression function using the selected model
\[\beta^\star(M) = \argmin_{\tilde\beta} \bbE\left[\Vert Y - X_M\tilde\beta\Vert^2 \right] = (X_M^TX_M)^{-1}X_M^T\mu\]an assumption of this procedure is that the variance is known in order to do the initial split between $f(Y)$ and $g(Y)$ during the fission phase.
data fission will outperform data splitting in settings with small sample size with a handful of points with high leverage
Application to selective CIs in fixed-design generalized linear models and other quasi-MLE problems
it is important to construct CIs that are robust to model misspecification for two reasons
- it is unlikely in practical settings that able to select a set of covariates that corresponds to the actual conditional mean of the data
- the post-selective distribution $f(Y)\mid g(Y)$ may be ill-behaved and difficult to work with even if $Y$ is easy to model
QMLE (quasi-maximum likelihood estimation): using MLE to train a model but to construct guarantees in an assumption-lean settings
Problem setup and a recap of QMLE methods
suppose $n$ independent random observations $y_i\in \IR$ alongside fixed covariates $x_i\in\IR^p$
Assumption 1: Data fission is conducted such that $g(y_i)\ind g(y_k) \mid f(Y), X$ for all $i\neq k$
Application to post-selection inference for trend filtering and other nonparametric regression problems
\[y_i = f_0(x_i) + \epsilon_i\]- $f_0$ is the underlying function to be estimated and $\epsilon_i$ is random noise
- denote $Y = (y_1,\ldots, y_n)^T, \epsilon = (\epsilon_1,\ldots,\epsilon_n)^T$
- decompose $Y$ into $f(Y) = Y+\tau Z$ and $g(Y) = Y - \frac{1}{\tau} Z$ where $Z\sim N(0, \Sigma)$
- use $f(X)$ to choose a basis $a_1,\ldots, a_p$ for the series expansion of $x_i$. Let $A$ denote the matrix of basis vectors for all $n$ data points
- use $g(X)\mid f(X)$ to construct pointwise or uniform CIs
the fitted line
\[\hat\beta(A) = \argmin_\beta \Vert g(Y) - A\beta\Vert^2 = (A^TA)^{-1}A^Tg(Y)\]meanwhile, define the projected regression function
\[\beta^\star(A) = \argmin_\beta \bbE\left[ \Vert Y - A\beta\Vert^2 \right]\]