# Linear Regression with Partially Shuffled Data

##### Posted on Oct 08, 2019

There has been significant interest in linear regression in the situation where predictors and responses are not observed in matching pairs corresponding to the same statistical unit as a consequence of separate data collection and uncertainty in data integration. Mismatched pairs can considerably impact the model fit and disrupt the estimation of regression parameters.

The paper: present a method to adjust for such mismatches under “partial shuffling” where a sufficiently large fraction of (predictors, response)-pairs are observed in their correct correspondence.

• the method based on a pseudo-likelihood where each term takes the form of a two-component mixture density.
• EM schemes are proposed for optimization
• scale favorably in the number of samples
• achieve excellent statistical performance relative to an oracle that has access to the correct pairings as certified by simulations and case studies.
• the proposed approach can tolerated larger fraction of mismatches than existing approaches, and enables estimation of the noise level as well as the fraction of mismatches.
• inference for the resulting estimator can be based on established theory for composite likelihood estimation.
• propose a statistical test for the presence of mismatches and establish its consistency under suitable conditions.

## Introduction

Broken Sample: assume that $\{(\x_{\pi^*(i)},y_i)\}_{i=1}^n$ are i.i.d. observations from some joint distribution $P_{\x, y;\theta^*}$, where $\theta^*$ is an unknown parameter and $\pi^*$ is an unknown permutation of $\{1,\ldots,n\}$.

In the context of linear regression, the following model has been studied under the terms “Unlabeled Sensing”, “Regression with unknown permutation” and “Regression with shuffled data”.

\begin{equation} \begin{aligned} y_i &= \x_{\pi^*(i)}^T\beta^\* + \sigma_\*\varepsilon_i\,, i=1,\ldots,n\,,\\ \y &= \Pi^*\X\beta^* +\sigma\bepsilon, \y=(y_i)_{i=1}^n, \Pi^*=(I(\pi^*(i)=j))_{i,j=1}^n,\X^T=[\x_1,\ldots,\x_n],\bepsilon=(\varepsilon_i)_{i=1}^n\,. \end{aligned} \label{model} \end{equation}

several recent papers have studied estimation of $\beta^*$ and/or $\pi^*$ under model \eqref{model}, predominantly under random Gaussian design with $\{\x_i\}_{i=1}^n\sim_{\text{i.i.d.}}N(0,I_d)$ and Gaussian errors.

Some researchers show that in the noiseless case $(\sigma_*=0)$, $\beta^*$ can be uniquely recovered by exhaustive search over permutations if $n > 2d$.

In the noisy case, a series of properties have been established for the least squares problem

$\begin{equation} \min_{\Pi\in \cP_n,\beta\in\IR^d}\Vert \y-\Pi\X\beta\Vert_2^2\label{eq:2} \end{equation}$

where $\cP_n$ denotes the set of $n$-by-$n$ permutation matrices.

• someone shows that \eqref{eq:2} is NP-hard
• someone derives necessary and sufficient conditions for exact and approximate recovery of $\Pi^*$ based on \eqref{eq:2}, and elaborates on the significance of the signal-to-noise ratio (SNR) $\Vert \beta^*\Vert_2^2/\sigma^2$ in the context.
• an excessively large SNR of the order $n^2$ is proved to be a necessary condition for approximate permutation recovery for any estimator.
• someone shows that the SNR needs to be at least of the order $d/\log\log n$ to ensure approximate recovery of $\beta^*$.
• problem \eqref{eq:2} can be shown to suffer from overfitting due to the extra freedom in optimizing $\Pi$.

algorithms for \eqref{eq:2}:

• a particular scheme has polynomial time complexity, but is “not meant for practical deployment”
• the convex relaxation of \eqref{eq:2} in which $\cP_n$ is replaced by its convex hull, the set of doubly stochastic matrices, was observed to yield poor performance.
• alternating minimization as well as the use of the EM algorithm (combined with MCMC sampling) where $\Pi^*$ constitutes missing data.
• branch-and-bound scheme with promising empirical performance on small datasets.

$k$-sparse: a simplified setting of \eqref{model}, where $\pi^*(i)=i$ except for at most $k « n$ elements of $\{1,\ldots,n\}$,

• under this restriction on $\pi^*$, the constrained least squares estimator corresponding to \eqref{eq:2} jas desirable statistical properties if the fraction $k/n$ is not too large.
• a convex relaxation of that constrained least squares problem yields an estimator $\beta^*$ that is consistent under suitable conditions on $k/n$; the permutation can be estimated subsequently by sorting.

some papers extend it to multivariate regression setup where $\{y_i\}_{i=1}^n$ have dimension $m \ge 1$ each.

• the permutation recovery can succeed without stringent conditions on the SNR once $m$ is at least of the order $\log n$.

some paper complements this result with matching information-theoretic lower bounds

some paper considers “spherical regression with mismatch corruption” with responses and predictors being contained in the unit sphere of in $\IR^m$.

in addition to sparsity:

• some paper additionally assumes $\Pi^*$ to have a block structure
• $\Pi^*$ is not required to be a permutation to allow a slightly more general class of mismatches.

Instead of linear regression, some papers study isotonic regression, i.e.,

$y_i = f^*(x_{\pi^*}(i)) + \sigma_*\varepsilon_i\,,$

with $\{x_i\}_{i=1}^n$ and $\{y_i\}_{i=1}^n$ being one-dimensional, and a non-decreasing regression function $f^*$.

### Contributions

build on the setup of sparse permutations, and improvements with regard to two important aspects.

two downsides of the other paper:

• mismatches are treated as generic data contamination, which leads to a substantial loss of information. performances degrades severely as the fraction of mismatches $k/n$ increases.
• its dependence on a tuning parameter involving the noise level $\sigma_*$, which is generally not known nor easy to estimate. But the approach proposed herein is in principle tuning-free

the approach is far less affected as $\alpha_*$ increases and empirically performs rather closely to the oracle least squares estimator equipped with knowledge of $\pi^*$

the proposed approach also avoids a quadratic runtime in $n$ that is incurred for alternatives such as the Lahiri-Larsen estimator.

Optimization is based on a pseudolikelihood having the form of a likelihood for fitting a two-component mixture model, with one component corresponding to a regular linear regression model without mismatches and the second component accounting for extra variation due to mismatches.

• despite the nonconvexity of the resulting optimization problem, reasonable approximate solutions can be obtained via the EM algorithm

Obtain asymptotic standard errors and confidence intervals for $(\beta^*, \sigma_*,\alpha_*)$.

Propose a test for the null hypothesis $H_0:\Pi^*=I_n$, i.e., a test for the presence of mismatches, and show its consistency under suitable conditions.

Published in categories Memo