WeiYa's Work Yard

A traveler with endless curiosity, who fell into the ocean of statistics, tries to write down his ideas and notes to save himself.

Scaling Laws for Surrogate Data

Posted on (Update: )
Tags: Surrogate Data, Generative Models

This note is for Jain, A., Montanari, A., & Sasoglu, E. (2024, November 6). Scaling laws for learning with real and surrogate data. The Thirty-eighth Annual Conference on Neural Information Processing Systems.

scaling laws for learning with real and surrogate data

  • collecting large quantities of high-quality data can be prohibitively expensive or impractical
  • one may instead augment a small set of $n$ data points from the target distribution with data from more accessible sources, e.g., data collected under different circumstances or synthesized by generative models. they refer to such data as ‘surrogate data’
  • they study a weighted empirical risk minimization (ERM) approach for integrating surrogate data into training
  • several findings
    • integrating surrogate data can significantly reduce the test error on the original distribution. surprisingly, this can happen even when the surrogate data is unrelated to the original ones. they track this behavior to the classical Stein’s paradox
    • in order to reap the benefit of surrogate data, it is crucial to use optimally weighted ERM
    • the test error of models trained on mixtures of real and surrogate data is approximately described by a scaling law. this scaling law can be used to predict the optimal weighting scheme, and to choose the amount of surrogate data to add

Introduction and overview

  • $n$ i.i.d. points $z_i$ from a target distribution $\cal{D}$
  • given a family of parameteric models governed by the parameters’ vector $\theta$, the goal is to find $\theta$ that minimizes the expected test loss $R_{test}(\theta)$, where expectation is taken over $\cD$
  • surrogate data $Z^s = (z_i^s)_{i\le m}$ to be i.i.d. samples with distribution $\cD^s$
  • in general, not assume the distribution $\cD^s$ of synthetic data to be close to the original data distribution $\cD$. However, assume that these distributions are over the same domain.

a number of questions arise:

  1. how should one use the surrogate data in training?
  2. how many surrogate samples should one add to the original data?
  3. can one predict the improvement in test error achieved by adding surrogate samples to the training?

a natural approach would be to add the surrogate data to the original one in the usual training procedure. Namely, one attempts to minimize the overall empirical risk

\[\hat R_{n+m}^{naive}(\theta) = \sum_{i=1}^n \ell(\theta; z_i) + \sum_{i=1}^m \ell(\theta; z_i^s)\]

where $\ell(z, \theta)$ is a train loss function.

however, this approach has serious shortcomings.

consider a simple mean estimation problem, whereby $z_i\sim N(\theta_\star, I_d), z_i^s \sim N(\theta_\star^s, I_d)$, $\ell(\theta, z) = \Vert \theta - z\Vert^2$, and $R_{test}(\theta) = \Vert \theta - \theta_\star\Vert^2$

a straightforward calculation yields that the test error of the empirical risk minimizer $\hat\theta_{n+m}^{\text{naive}}:=\argmin \hat R_{n+m}^{\text{naive}}(\theta)$ is

\[R_{test}(\hat\theta_{n+m}^{\text{naive}}) = \left(\frac{n}{n+m}\right)^2\Vert \theta_\star^s-\theta_\star\Vert^2 + \frac{d}{n+m}\]

as $m$ increases the variance (the second term) decreases, but the bias due to the difference $\Vert \theta_\star^s - \theta_\star\Vert$ increases, and the error approaches $\Vert \theta_\star^s - \theta_\star\Vert^2$, i.e., the model will be only as good as if training only on surrogate data.

in order to overcome these limitations, study a weighted ERM approach, and will show that the weight plays a crucial role. Namely, consider the following regularized empirical risk

\[\hat R_{n, m}(\theta, \alpha ) = \frac{1-\alpha}{n} \sum_{i=1}^n\ell(\theta; z_i) + \frac{\alpha}{m}\sum_{i=1}^m \ell(\theta; z_i^s) + \Omega(\theta)\]

where $\alpha\in [0, 1]$ is the weight of the surrogate dataset and $\Omega: \IR^d \rightarrow \IR_{\ge 0}$ is a regularizer, e.g., a ridge $\Omega(\theta) = \lambda \Vert \theta\Vert_2^2$

denote by

\[\hat\theta_{n, m} (\alpha) := \argmin_{\theta} \hat R_{n, m}(\theta; \alpha)\]

they observe that the weighted ERM approach systematically achieves better test error than either training only on original data $(\alpha \rightarrow 0)$ or on surrogate data $(\alpha \rightarrow 1)$

further the error for optimal $\alpha$ is always monotone decreasing both in $m$ and $n$, and the approach outperforms the naive unweighted approach

summary of results

the mathematical results are developed in four different settings:

  1. the Gaussian sequence model
  2. a non-parametric function estimation setting
  3. low-dimensional empirical-risk minimization
  4. high-dimensional ridge regression

conclusions:

  1. surrogate data improve test error, even if the surrogate data distribution is far from the original one
  2. tuning of $\alpha$. nearly optimal $\alpha$ can be effectively selected by minimizing the error on a validation split of the original data
  3. scaling law. propose a scaling law that captures the behavior of the test error with $n, m, \alpha$
  4. practical uses of the scaling law. predict how much the test error can be decreased by including any number $m$ of surrogate samples in the mix. One can further leverage the scaling law to achieve the desired error by
    • using the scaling law to determine the number of surrogate samples needed to achieve the desired performance
    • acquiring the surrogate samples and train the model using weighted ERM with optimal weighting predicted by scaling law.

the use of surrogate data to enhance training has attracted increasing research effort

  • the line of work has largely focused on the techniques to generate synthetic data that are well suited for training
    • generation data for computer vision tasks, ranging from object classification to semantic segmentation
    • in natural language processing
  • scaling laws have been broadly successful in guiding the development of large machine learning models
  • in data augmentation, the original samples are supplemented with transformed or noisy version of the same. in contrast, assume that surrogate data is obtained from a different source than the original one, and the surrogate samples are independent of the original samples
  • the problem also studied within ‘domain adaptation’, a subarea of transfer learning. [BDBC+10] establishes bounds on the generalization error of weighted ERM via uniform convergence. however these bounds do not reveal the full advantage achieved by this approach and are not precise enough to justify the scaling laws
    • recent work in domain adaptation study the behavior of test error and its scaling laws, but only considers vanilla ERM, a special of weighted ERM

Regularization, Gaussian mean estimation, Stein paradox

the role of the parameter $\alpha$ can be understood by considering the limit $m\rightarrow\infty$

\[\hat R_{n,\infty}(\theta; \alpha) = \frac{1-\alpha}{n}\sum_{i=1}^n\ell(\theta; z_i) + \alpha R^s(\theta) + \Omega(\theta)\]

and $R^s(\theta)$ is the population risk for surrogate data. This suggests to think of the surrogate data as an additional (highly non-trivial) regularizer, with parameter $\alpha$.

this leads to a simple yet important insight: adding surrogate data to the original data is beneficial if $\alpha$ is chosen optimally, and large $m$ reduces statistical fluctuations in this regularizer

this contrasts with the unweighted approach whose test error in general deteriorates for large $m$

as a toy example, reconsider the mean estimation problem:

\[z_i \sim N(\theta_\star, I_d), \; z_i^s \sim N(\theta_\star^s, I_d), \ell(\theta, z) = \Vert \theta - z\Vert^2\]

and

\[R_{test}(\theta) = \Vert \theta - \theta_\star\Vert^2\]

then we have

\[\hat \theta_{n, m}(\alpha) = (1 - \alpha)\sum_{i\le n}z_i/n + \alpha \sum_{i\le m} z_i^s/m\]

in other words, the weighted ERM shrinks the mean of the original data towards the mean of the surrogate data,

weighted ERM always achieves better error than training only on original data, regardless of the distance between original and surrogate data

this is a disguised version of the celebrated Stein paradox: in estimating a Gaussian mean, a procedure that shrinks the empirical mean towards an arbitrary point by a carefully chosen amount outperforms the naive empirical mean.

in the toy example, the naive empirical mean corresponds to estimation purely based on the original data, and here shrinks towards the mean of the surrogate data. Of course, the improvement over empirical mean is only possible if $\alpha$ is chosen optimally.

Stein’s analysis implies that in the Gaussian mean problem, $\alpha$ can be chosen empirically as long as the dimension of $\theta$ is $d \ge 3$.

in the settings here, $\alpha$ can be chosen via cross-validation.


Published in categories