Linear Regression with Partially Shuffled Data
Posted on
This post is based on Slawski, M., Diao, G., & Ben-David, E. (2019). A Pseudo-Likelihood Approach to Linear Regression with Partially Shuffled Data. ArXiv:1910.01623 [Cs, Stat].
There has been significant interest in linear regression in the situation where predictors and responses are not observed in matching pairs corresponding to the same statistical unit as a consequence of separate data collection and uncertainty in data integration. Mismatched pairs can considerably impact the model fit and disrupt the estimation of regression parameters.
The paper: present a method to adjust for such mismatches under “partial shuffling” where a sufficiently large fraction of (predictors, response)-pairs are observed in their correct correspondence.
- the method based on a pseudo-likelihood where each term takes the form of a two-component mixture density.
- EM schemes are proposed for optimization
- scale favorably in the number of samples
- achieve excellent statistical performance relative to an oracle that has access to the correct pairings as certified by simulations and case studies.
- the proposed approach can tolerated larger fraction of mismatches than existing approaches, and enables estimation of the noise level as well as the fraction of mismatches.
- inference for the resulting estimator can be based on established theory for composite likelihood estimation.
- propose a statistical test for the presence of mismatches and establish its consistency under suitable conditions.
Introduction
Broken Sample: assume that $\{(\x_{\pi^*(i)},y_i)\}_{i=1}^n$ are i.i.d. observations from some joint distribution $P_{\x, y;\theta^*}$, where $\theta^*$ is an unknown parameter and $\pi^*$ is an unknown permutation of $\{1,\ldots,n\}$.
In the context of linear regression, the following model has been studied under the terms “Unlabeled Sensing”, “Regression with unknown permutation” and “Regression with shuffled data”.
\[\begin{equation} \begin{aligned} y_i &= \x_{\pi^*(i)}^T\beta^\* + \sigma_\*\varepsilon_i\,, i=1,\ldots,n\,,\\ \y &= \Pi^*\X\beta^* +\sigma\bepsilon, \y=(y_i)_{i=1}^n, \Pi^*=(I(\pi^*(i)=j))_{i,j=1}^n,\X^T=[\x_1,\ldots,\x_n],\bepsilon=(\varepsilon_i)_{i=1}^n\,. \end{aligned} \label{model} \end{equation}\]Related work
several recent papers have studied estimation of $\beta^*$ and/or $\pi^*$ under model \eqref{model}, predominantly under random Gaussian design with $\{\x_i\}_{i=1}^n\sim_{\text{i.i.d.}}N(0,I_d)$ and Gaussian errors.
Some researchers show that in the noiseless case $(\sigma_*=0)$, $\beta^*$ can be uniquely recovered by exhaustive search over permutations if $n > 2d$.
In the noisy case, a series of properties have been established for the least squares problem
\[\begin{equation} \min_{\Pi\in \cP_n,\beta\in\IR^d}\Vert \y-\Pi\X\beta\Vert_2^2\label{eq:2} \end{equation}\]where $\cP_n$ denotes the set of $n$-by-$n$ permutation matrices.
- someone shows that \eqref{eq:2} is NP-hard
- someone derives necessary and sufficient conditions for exact and approximate recovery of $\Pi^*$ based on \eqref{eq:2}, and elaborates on the significance of the signal-to-noise ratio (SNR) $\Vert \beta^*\Vert_2^2/\sigma^2$ in the context.
- an excessively large SNR of the order $n^2$ is proved to be a necessary condition for approximate permutation recovery for any estimator.
- someone shows that the SNR needs to be at least of the order $d/\log\log n$ to ensure approximate recovery of $\beta^*$.
- problem \eqref{eq:2} can be shown to suffer from overfitting due to the extra freedom in optimizing $\Pi$.
algorithms for \eqref{eq:2}:
- a particular scheme has polynomial time complexity, but is “not meant for practical deployment”
- the convex relaxation of \eqref{eq:2} in which $\cP_n$ is replaced by its convex hull, the set of doubly stochastic matrices, was observed to yield poor performance.
- alternating minimization as well as the use of the EM algorithm (combined with MCMC sampling) where $\Pi^*$ constitutes missing data.
- branch-and-bound scheme with promising empirical performance on small datasets.
$k$-sparse: a simplified setting of \eqref{model}, where $\pi^*(i)=i$ except for at most $k « n$ elements of $\{1,\ldots,n\}$,
- under this restriction on $\pi^*$, the constrained least squares estimator corresponding to \eqref{eq:2} jas desirable statistical properties if the fraction $k/n$ is not too large.
- a convex relaxation of that constrained least squares problem yields an estimator $\beta^*$ that is consistent under suitable conditions on $k/n$; the permutation can be estimated subsequently by sorting.
some papers extend it to multivariate regression setup where $\{y_i\}_{i=1}^n$ have dimension $m \ge 1$ each.
- the permutation recovery can succeed without stringent conditions on the SNR once $m$ is at least of the order $\log n$.
some paper complements this result with matching information-theoretic lower bounds
some paper considers “spherical regression with mismatch corruption” with responses and predictors being contained in the unit sphere of in $\IR^m$.
in addition to sparsity:
- some paper additionally assumes $\Pi^*$ to have a block structure
- $\Pi^*$ is not required to be a permutation to allow a slightly more general class of mismatches.
Instead of linear regression, some papers study isotonic regression, i.e.,
\[y_i = f^*(x_{\pi^*}(i)) + \sigma_*\varepsilon_i\,,\]with $\{x_i\}_{i=1}^n$ and $\{y_i\}_{i=1}^n$ being one-dimensional, and a non-decreasing regression function $f^*$.
Contributions
build on the setup of sparse permutations, and improvements with regard to two important aspects.
two downsides of the other paper:
- mismatches are treated as generic data contamination, which leads to a substantial loss of information. performances degrades severely as the fraction of mismatches $k/n$ increases.
- its dependence on a tuning parameter involving the noise level $\sigma_*$, which is generally not known nor easy to estimate. But the approach proposed herein is in principle tuning-free
the approach is far less affected as $\alpha_*$ increases and empirically performs rather closely to the oracle least squares estimator equipped with knowledge of $\pi^*$
the proposed approach also avoids a quadratic runtime in $n$ that is incurred for alternatives such as the Lahiri-Larsen estimator.
Optimization is based on a pseudolikelihood having the form of a likelihood for fitting a two-component mixture model, with one component corresponding to a regular linear regression model without mismatches and the second component accounting for extra variation due to mismatches.
- despite the nonconvexity of the resulting optimization problem, reasonable approximate solutions can be obtained via the EM algorithm
Obtain asymptotic standard errors and confidence intervals for $(\beta^*, \sigma_*,\alpha_*)$.
Propose a test for the null hypothesis $H_0:\Pi^*=I_n$, i.e., a test for the presence of mismatches, and show its consistency under suitable conditions.