WeiYa's Work Yard

A traveler with endless curiosity, who fell into the ocean of statistics, tries to write down his ideas and notes to save himself.

Data Thinning for Convolution-Closed Distributions

Posted on
Tags: Data Thinning

This note is for Neufeld, A., Dharamshi, A., Gao, L. L., & Witten, D. (2024). Data Thinning for Convolution-Closed Distributions. Journal of Machine Learning Research, 25(57), 1–35.

sample splitting cannot be applied when there is one parameter of interest per observation, or the parameter of interest is a function of the n observations

  • when estimating a low-rank approximation to a matrix, there is one parameter of interest of interest (a latent variable coordinate) for each of the n rows in the matrix
  • in fixed-covariate regression under model misspecification, the target parameter depends on the specific n observations included in the data set
  • settings in which we wish to draw observation-specific inferences about each of the n observations

an alternative to sample splitting

outside of the following two distributions, no proposals are available to split a random variable into independent parts that follow the same distribution as the original random variable.

  • split XN(μ,σ2) with known σ2 into two independent Gaussian random variables
  • split XPoisson(λ) into two independent Poisson random variables

Gamma decomposition into M components, data thinning: suppose that XGamma(α,β), where β is unknown. Take (X(1),,X(M))=XZ, where ZDirichlet(α/M,,α/M). Then X(1),,X(M) are mutually independent, they sum to X, and each is marginally drawn from a Gamma(a/M,β) distribution.

  • Section 6: validating the results of clustering and low-rank matrix approximations

The Data Thinning Proposal

A review of convolution-closed distributions

Let Fλ denote a distribution indexed by a parameter λ in parameter space Λ. Let XFλ1 and XFλ2 with XX. If X+XFλ1+λ2 whenever λ1+λ2Λ, then Fλ is convolution-based in the parameter λ

image

datat thinning

image

image

effect of unknown nuisance parameters

consider what happens when perform data thinning on Gaussian data using an incorrect value of the variance

Multifold data thinning

Comparing data thinning and sample splitting

Theoretical comparison to sample splitting


Published in categories