Boosting Data Analytics with Synthetic Data

Posted on Dec 03, 2024 (Update: Apr 21, 2025)

Tags: Synthetic Data

This post is for Shen, X., Liu, Y., & Shen, R. (2024). Boosting Data Analytics With Synthetic Volume Expansion (No. arXiv:2310.17848). arXiv. https://doi.org/10.48550/arXiv.2310.17848

synthetic data generation, a cornerstone of Generative AI, promotes a paradigm shift in data science by addressing data scarcity and privacy while enabling unprecedented performance.

the paper

explore the effectiveness of statistical methods on synthetic data and the privacy risks of synthetic data
present the Synthetic Data Generation for Analytics framework.
- it applies statistical approaches to high-quality synthetic data produced by generative models like tabular diffusion models,

a key finding within this framework: generation effect

reveal that the error rate of statistical methods on synthetic data decreases with the addition of more synthetic data but many eventually rise or stabilize
this phenomenon, stem from the challenge of accurately mirroring raw data distributions, highlights a “reflection point”—an ideal volume of synthetic data defined by specific error metrics

through three case studies:

sentiment analysis
predictive modeling of structured data
inference in tabular data

the paper validates the superior performance of this framework compared to conventional approaches

On privacy, synthetic data imposes lower risks while supporting the differential privacy standard. These studies underscore synthetic data’s untapped potential in redefining data science’s landscape

Introduction

the paper investigates the challenges of efficacy and privacy in synthetic data utilization, emphasizing the role of synthetic data in enhancing data analytics

introduce the synthetic data generation for analytics (Syn) framework, aimed at increasing the precision of statistical methods applied to high-fidelity synthetic data that accurately replicates raw data through transfer learning.

this framework mirrors the statistical properties of raw data through advanced knowledge transfer techniques while offering a potential solution for data sharing without compromising data privacy.

synthetic data provides two main benefits for data analytics:

it mitigates data scarcity
addresses privacy issues

in the Syn framework, a generative model, informed by raw data, is used to produce synthetic data that replicates its distribution, incorporating insights from pre-trained models in relevant studies through transfer learning.

the framework enables the integration of both pre-trained and generative models of various kinds that share comparable latent structures

it employs an array of generative models tailored for different domains

image diffusion
text diffusion
text-to-image diffusion models
tabular diffusion models

syn embraces advanced models such as the reversible generative models

these flow-based models capture the raw data distribution and can estimate both conditional and marginal distributions

a critical issue of efficacy: can high-fidelity synthetic data boost the effectiveness of statistical methods solely reliant on raw data?

if so, how may one implement such enhancements?
recent research offers diverging viewpoints on this issue
- synthetic X-ray images improve the accuracy of machine learning models (Gao et al., 2023)
- training on synthetic data may compromise performance for some machine learning models (Kotelnikov et al., 2023)

the investigation reveals that

synthetic data, when accurately generated, can boost the accuracy of a statistical method by augmenting the sample size of raw data
however, a statistical method applied to low-fidelity synthetic data could yield unreliable outcomes

a generation effect: as the size of synthetic data grows, the precision gain of a method may diminish

generation errors or discrepancies between the data-generation distributions of synthetic and raw data
the generational effect underscores a key concern: regardless of the size of synthetic data, generation errors can compromise the accuracy of a statistical method

(garbage in, garbage out)

challenge: while evaluating a supervised task is straightforward, hypothesis testing presents the challenge of regulating the Type-I error while enhancing the power of a test

Syn: Syn-Test, Syn-Slm

(reminds me another talk: resampling, historical reason, current another scene of resampling)

Syn-Test:

determine the ideal size of synthetic data required to manage the empirical Type-I error while performing a test for finite inference samples using Monte Carlo methods
the ideal size of synthetic data can heighten the accuracy
their theoretical investigation sheds light on the generational effect, precision, and the size of synthetic data

Syn-Slm:

a streamlined approach enhancing Syn’s usability in various applications
this method bypasses the training of conventional predictive models in supervised learning tasks. Instead, it directly models the outcome’s conditional distribution through advanced generative models
the effectiveness of Syn-Slm is demonstrated with two examples: sentiment analysis using text data and regression analysis with tabular data

three areas:

sentiment analysis
- OpenAI’s GPT-3.5 for Syn-Slm
- DistilBERT using transfer learning
- LSTM in the traditional framework for analyzing consumer reviews from the IMDB movie dataset
predictive modeling
- Syn-Boost, a version of CatBoost—a gradient boosting algorithm—trained on synthetic data
tabular data inference
- explore feature significance tests in discerning feature relevance in regression and classification using CatBoost
- Dai, Shen, and Pan (2024): an asymptotic test through sample splitting

Enhancing Statistical Accuracy

2.1 Synthetic Data

apply statistical methods on a synthetic sample $\tilde Z^{(m)} = (\tilde Z_i)_{i=1}^m$, this sample is generated by a generative model trained on raw data $Z^{(n)} = (Z_i)_{i=1}^n$ through fine-tuning a pre-trained model

these models include

GPT
diffusion models
normalizing flows
GANs

In this framework, the CDF $\tilde F$ of $\tilde Z^{(m)}$ estimates the CDF $F$ of $Z^{(n)}$.

To yield $\tilde F$ directly, one can utilize some revertible generative models such as normalizing flows and Roundtrip GAN, acting as a nonparametric estimate of $F$

For other generative models, such as diffusion models and GPT, one can typically obtain $\tilde F$ from synthetic daat employing Monte Carlo methods

in the numerical examples, the paper uses a tabular diffusion model (TDM, Kotelnikov et al., 2023) and GPT for synthetic data generation

2.2 Optimal Synthetic Size for Estimation and Prediction

the numerical insights reveal the existence of a reflection point, denoted as

\[m_0 = \argmin_{m\ge 1} R(\hat\theta(\tilde Z^{(m)}))\]

which delineates a relationship between the synthetic sample size $m$ and the accuracy augmentation for this method.

this point $m_0$ is governed by the generation error measured by metrics such as the total variation distance between $\tilde F$ and $F$, denoted as

\[TV(\tilde F, F) = \sup_B\vert P_F(B) - P_{\tilde F}(B)\vert\,.\]

to estimate $m_0$, optimize its empirical risk measure across $m$ on an independent cross-validation sample from the original resources, which yields an optimizer $\hat m$ as an estimate of $m_0$

2.3 Optimal Synthetic Size for Hypothesis Testing

Syn-Test yields two distinct advantages.

it employs synthetic data to gauge the null distribution of any test statistic by Monte Carlo methods as in the bootstrap approach
identifies the optimal synthetic data size, optimizing a test’s power while maintaining a suitable control of Type-I errors

Given a raw sample, Syn-Test employs two nearly equal-sized subsamples $S_1$ and $S_2$, partitioned from a training sample for fine-tuning a pre-trained generative model. It also uses a separate validation sample $S_3$ for validating model training.

one generative model generates synthetic data using $S_1$ for null distribution estimation
the other uses $S_2$ for computing the test statistic

Syn-Test also empirically determines the optimal synthetic data size $m_0$ to control the Type-I error.

By swapping the roles of $S_1$ and $S_2$, Syn-Test can de-randomize the partition, transitioning the original inference sample size $n$ to a synthetic inference sample size of $m$

Syn-Test encompasses four steps, using a significance level $\alpha$, a tolerance error $\varepsilon$, and a Monte Carlo size $D$

Step 1: Controlling Type I error.
- generate $D$ distinct synthetic samples $(T(\tilde Z_1^{(m, d)}))_{d=1}^D$ of size $m$ by refining a pre-trained generative model with $S_1$ under $H_0$.
- use $S_3$ as a validation set for model tuning to avoid overfitting.
- compute the empirical distribution of the test statistic $T$ using $(T(\tilde Z_1^{(m, d)}))_{d=1}^D$. Define a rejection region $C_m$ at a significance level $\alpha > 0$.
Step 2: Optimizing Synthetic Size through Tuning
- execute Step 1, but use $S_2$ instead of $S_1$ to produce $\tilde Z_2^{(m, d)}$.
- use the empirical distribution from $(T(\tilde Z_2^{(m, d)}))_{d=1}^D$ to find the empirical Type-I error, denoted $\tilde P(C_m)$, for the $C_m$ created in Step 1
- two strategies for identifying $\hat m$
  - aggressive: $\hat m = \max\{m:\tilde P(C_m) \le \alpha + \varepsilon\}$
  - conservative: $\hat m = \min\{m: \tilde P(C_m)\le \alpha + \varepsilon\quad \text{and} \quad \tilde P(C_{m+1}) > \alpha + \varepsilon\}$

what if $\tilde P(C_m) > \alpha$ for all $m$.

Step 3: Calculating the p-value
- with the determined $\hat m$, produce synthetic data $\tilde Z_2^{(\hat m)}$ by fine-tuning the pre-trained generative model using $S_2$ with $S_3$ as a validation set
- calculate the test statistic for $T(\hat Z_2^{(\hat m)})$ and determine the P-value, $P^1$, leveraging the null CDF estimated from $S_1$ in Step 1
Step 4: Combining the p-values
- repeat Step 3 by interchanging the roles of $S_1$ and $S_2$ to compute the P-value $P^2$. Combine P-values via Hommel’s weighted average
- Hommel’s method excels in controlling the Type-I error relative to many of its peers, ensuring $P(\bar P\le \alpha)\le\alpha$ under $H_0$

2.4 Syn-Slm: Streamlined Approach

a streamlined approach for supervised learning tasks.

Syn-Slm directly models the conditional distribution of the outcome given predictors without relying on an additional predictive model for synthetic data

supervised learning aims to predict an outcome based on a set of predictors, denoted as $X$. Consider a scenario with a one-dimensional outcome variable $Y$ and define $Z = (Y, X)$.

The goal is to estimate a statistical functional, $\phi(F_{Y\mid X})$.

the Syn-Slm method introduces a plug-in estimate $\phi(\tilde F_{Y\mid X})$

(TODO: remind me of the GBS by Minsuk)

Syn-Slm directly models the conditional generation of $Y\mid X$ and estimate $\hat F_{Y\mid X}$ via a Monte Carlo approach, offering a fully non-parametric solution.

Syn-Slm enables the estimation of various statistical functionals $\phi(F_{Y\mid X})$, such as $E(Y\mid X)$, $\Var(Y\mid X)$, and quantiles of $Y\mid X$

Generative Model and Knowledge Transfer

from the perspective of dimension reduction, dissect knowledge transfer in two scenarios

consider a generative model $g_\theta$ parametrized by $\theta$
- originally trained on an extensive dataset for a generation task, the model undergoes subsequent fine-tuning on a smaller but similar dataset to account for distribution shift, resulting in model $g_{\theta’}$, where the architecture remains consistent across both models, with the essence of knowledge transmission occurring via the transition from $\theta$ to $\theta’$ amid the fine-tuning
a robust pre-trained model undergoes training across multiple tasks, characterized as $(f_1,\ldots, f_t)\circ h$, where $f_i$ defines the output function tied to the $i$-th task, $h$ is the shared representation function
- give a learned representation $h$, one only fine-tune $f_0$ during its optimization phase for $f_0$.
- within this configuration, $f_0\circ h$ and $f_i\circ h$ only share the same architecture in $h$. An alternate strategy entails concurrent fine-tuning of $f_0$ and $h$ to derive a representation explicitly for $f_0$

Case Studies

Sentiment Analysis

IMDB dataset:

50,000 polarized movie reviews

compare Syn’s generative approach against its conventional counterpart in a downstream prediction task, utilizing three models, GPT-3.5, DistilBERT, and LSTM models

GPT-3.5 functions primarily as a text completion model, predicting the succeeding token as the sentiment label
- although essentially a completion model, it is a conditional generative model that aligns with Syn-Slm,
DistilBERT generates a fixed-size embedding of a review, which is then passed to an appended classification head to deduce sentiment likelihood
- it aligns more closely with traditional predictive modeling with transfer learning for supervised tasks
LSTM, like DistilBERT, processes an embedding and feeds it into a classification head, rendering it a predictive model

Prediction for Structured Data

investigate the generational phenomenon and challenges associated with enhancing the precision of gradient-boosting for regression and classification

Real-Benchmark Examples

utilize a tabular diffusion model, TDM (Kotelnikov et al., 2023), to generate synthetic data of mixed types that closely match the distribution of the original data

TDM employs multinomial and Gaussian diffusion processes to simulate categories and continuous attributes

Published in categories

← previous next →

See all posts →

WeiYa's Work Yard

A traveler with endless curiosity, who fell into the ocean of statistics, tries to write down his ideas and notes to save himself.