Generative Models via Transfer Learning
Posted on
this note is for Tian, X., & Shen, X. (2025). Enhancing Accuracy in Generative Models via Knowledge Transfer (No. arXiv:2405.16837). arXiv.
the paper investigate the accuracy of generative models and the impact of knowledge transfer on their generation precision
- examine a generative model for a target task, fine-tuned using a pre-trained model from a source task
- build on the “shared embedding” concept, which bridges the source and target tasks, they introduce a novel framework for transfer learning under distribution metrics such as the KL divergence
- the framework underscores the importance of leveraging inherent similarities between diverse tasks despite their distinct data distributions
- the theory suggests that the shared structures can augment the generation accuracy for a target task, reliant on the capability of a source model to identify shared structures and effective knowledge transfer from source to target learning
- to demonstrate the practical utility of this framework, they explores the theoretical implications for two specific generative models: diffusion and normalizing flows
- the results show enhanced performance in both models over their non-transfer counterparts, indicating advancements for diffusion models
- and providing fresh insights into normalizing flows in transfer and non-transfer settings
1. Introduction
generative modeling, augmented with transfer learning
- this process distills knowledge from extensive, pretrained models previously trained on large datasets from relevant studies, enabling domain adaptation for specific tasks
- at its core is the dynamic between the source (pre-trained) and target (fine-tuning) learning tasks, which tend to converge towards shared, concise representations
- this principle less attention in diffusion models and normalizing flows
this paper presents a theoretical framework to assess the accuracy of outputs from generative models
- accurately evaluating the fidelity of data produced by generative models is increasingly critical for downstream analyses and for maintaining users’ trust in synthetic data
- although empirical studies show that tranfer learning improves diffusion-based generators for both images and tabular data, its theoretical effect on generative accuracy remains underexplored
- poorly matched source tasks can even induce negative transfer, degrading performance and jeopardizing truthworthy AI goals through misleading scientific conclusions
- by contrast, transfer learning in supervised settings has been thoroughly analyzed, underscoring the need for principled study in the generative realm
- complementing diffusion and flow research, Generative Adversarial Networks (GANs) provide a mature toolkit for domain adaptation
- feature-level alignment via Domain-Adversarial Training
- unpaired image-to-image translation with CycleGAN
- multi-domain or data-efficient extension such as StarGAN
- these methods deomnstrate that adversarial alignment – whether in latent or pixel space — remains an effective paradigm for cross-domain generation
review the relevant literature on the accuracy of two advanced generative models, diffusion models and normalizing flows
- diffusion models
- Oko et al. (2023): convergence rates for unconditional generation for smooth densities
- Chen et al. (2023b): distribution recovery over a low-dimensional linear subspace
- although a conditional diffusion model has show effectiveness, its theoretical foundation remains underexplored
- Fu et al. (2024): conditional diffusion models under a smooth density assumption
- flows
- the study of generation accuracy for flows remains sparse, with limited exceptions on universal approximation
the paper claims to address the accuracy of target generation
- the generation accuracy, measured by the excess risk, induces several valuable metrics such as the Kullback-Leibler (KL) divergence to assess the distribution closeness
contrubutions:
- generation accruacy theory:
- introduce the concept of “shared embedding” condition (SEC) to quantify the similarities between the latent representations of source and target learning
- the SEC distinguishes between conditional and unconditional generation by featuring nonlinear dimension reduction for the former while capturing shared latent representations through embeddings for the latter
- diffusion models and normalizing flows via transfer learning
- examine conditional generation with the KL divergence and the TV-norm for smooth target distributions and unconditional generation with the dimension-scaling Wasserstein distance, specifically in diffusion and coupling flows
- the analysis reveals that utilizing transfer learning strategies——grounded in the shared embeddings structures within the lower-dimensional manifold that bridge the source and target learning——holds the potential to elevate performance over non-transfer methodologies
- non-transfer diffusion models and normalizing flows
- diffusion models structured with the SEC framework achieve a faster KL rate than their non-transfer analogs in the TV-norm for conditional generation with a smooth density
- in conditional generation, their method exhibits a faster rate under the Wassertein distance relative to that under the TV-norm
- crucially, their analysis of coupling flows reveals its competitiveness compared to diffusion models in both conditional and unconditional generation
2. Enhancing generation accuracy and knowledge transfer
- Assumption 1: independent source and target data
2.1 Conditional generation
- traina a conditional generator for $X_t$ given $Z_t$ using $D_t = {x_t^i, z_t^i}_{i=1}^{n_t}$
- independent source training sample $D_s = {x_s^i, z_s^i}_{i=1}^{n_s}$
SEC for conditional generation:
- $X_t$ and $X_s$ are allowed to differ in dimensionality
- sample from $P_{X_t\mid Z_t}$
decompose the auxiliary vectors as
\[Z_t = (Z, Z_{t^c}), \qquad Z_s = (Z, Z_{s^c})\]where the common block $Z\in \IR^{d_c}$ is shared across tasks
Shared Embedding Condition (SEC)
- assume there exists a latent representation $h(Z)$ that is common to both tasks such that the conditional laws factor through task-specific decoders $P_t$ and $P_s$:
where $h_j(z_j) = (h(z), z_{j^c})$ and $P_j$ is a suitable probability function
- SEC presents a dimension reduction framework
- e.g., a source task of text prompt to image ($Z_s$ to $X_s$) and a target task of text prompt-to-music ($Z_t$ to $X_t$) may share common elements $Z$ based on a latent semantic representation $h(Z)$ and task-specific elements $Z_{j^c}$
parameterize
\[P_{x_j\mid z_j}(x, z_j) = P_j(x, h_j(z_j), \theta_j)\]with $h$ from $\Theta_h$
this approach defines the true distribution
\[P_{x_j\mid z_j}^0(x, z_j) = P_j(x, h_j^0(z_j); \theta_j^0)\]through true parameters $h_j^0(z_j) = (h^0(z), z_{j^c})$ and
\[(\theta_j^0, h^0) = \argmin_{\theta_j\in G_j, h\in \Theta_h} E_{x_j, z_j}l_j(X_j, Z_j;\theta_j, h)\]for the source task, minimize the empirical loss on a source training sample
\[(\hat\theta_s, \hat h) = \argmin_{\theta_s\in \Theta_s, h\in \Theta_h} L_s(\theta_s, h)\]the distribution discrepancy is controlled by the excess risk
\[E_{x_j, z_j}[l_j(X_j, Z_j; \theta_j, h) - l_j(X_j, Z_j; \theta_j^0, h^0)]\]for example, the negative log-likelihood loss yields an error bound in the excess risk, implying that in the KL divergence
- Assumption 2: transferability for conditional models
- characterizes the transitions of the excess risk for the latent structural representation $h$ from source to target tasks
- Assumption 3: source error
2.2 Unconditional generation
sample from the marginal target distribution $P_{X_t}$, transfer a latent representation learned on the source task
the SEC postulates that the source and target variables, $X_s$ and $X_t$, arise from a shared latent vector $U$ through task-specific decoders
\[X_t = g_t(U), \qquad X_s = g_s(U)\]consequently
\[P_{x_t}(\cdot) = P_u(g_t^{-1}(\cdot)),\qquad P_{x_s}(\cdot) = P_u(g_s^{-1}(\cdot))\]e.g., a source task of French text generation $X_s$ from English and target task of Chinese text generation $X_t$ from English
initially, the numerical embedding of a textual decription $U$ in English is transformed into French and Chinese using $g_s$ and $g_t$, respectively
given a latent representation ${u_s^i}{i=1}^{n_s}$ for ${x_s^i}{i=1}^{n_s}$ in the source training sample $D_s$ encoded by an encoder, first estimate the latent distribution $P_u$ using a generative model parameterized by $\Theta_u$ as $\hat P_u = P_u(\cdot; \hat\theta_u)$
with the estimated latent distribution $\hat P_u$, $g_t$ is estimated based on a target training sample
- Assumption 4: Source error for $U$
3. Diffusion models
3.1 Forward and Backward Processes
- forward process
- backward process
- score matching: to estimate the unknown score function, minimize a matching loss between the score and its approximator
3.2 Conditional diffusion via transfer learning
to compare transfer conditional and non-transfer diffusion generations, adopt the framework of the transfer model while omitting source learning
- connection to Fu et al. (2024) for non-transfer conditional generation
3.3 Unconditional diffusion via transfer learning
- Assumption 6: Smoothness of $g_t^0$
4. Normalizing flows
4.1 Coupling
normalizing flows transform a random vector $X$ into a base vector $V$ with known density $p_v$, through a diffeomorphic mapping $T(X)$, which is invertible and differentiable
the composition of these mappings, $T = \phi_K \circ \cdots \phi_1$, with each $\phi_j$ modeled by a neural network
the density of $X$ is expressed as
\[p_x(x) = p_v(T(x)) \lvert \text{det} \frac{\partial T(x)}{\partial x}\rvert\]with the determinant indicating the volume change under $T$
the maximum likelihood approach is used to estimate $T$, enabling the generation of new $X$ samples by inverting $T$ on samples from $p_v$
Counpling flows
partition $x$ into two parts $x = [x_1, x_2]$
each flow employs a transfromation
\[\phi_j(x_1, x_2) = (x_1, q(x_2, w_j(x_1)))\]the function $q$ modifies $x_2$ based on the output of $w$, where $q$ ensures that $\phi_j$ is invertible
Conditional coupling flows
to add an additional condition input of $z$ to the coupling layer, adjust
\[\phi_j(x_1, x_2, z) = (x_1, q(x_2, w(x_1, z)))\]4.2 Conditional flows via transfer learning
use the three-layer coupling flows with $q(x, y) = x+ y$
- Assumption 7 (Transformation)
4.3 Unconditional flows via transfer learning
- Assumption 8: Smoothness of $g_t^0$
Comparison of diffusion and flows
- generation accuracy
- limit of assumptions
- network architecture
5. Core Proof Strategy
- propagating error from source to target under the SEC condition
- controlling approximation and estimation errors in diffusion models
- simultaneously approximating mappings and their derivatives in flow models
6. Numerical experiments
6.1 Simulations
- conditional generation
- unconditional generation
6.2 Benchmark example: MNIST-USPS Digit Images
MNIST-USPS, a challenge transfer-learning task because the two handwritten-digit corpora differ substantially in resolution, stroke style, and intrr-class variability
- conditional generation
- use the MNIST dataset with varying training sample sizes, $n_s \in {1000, 5000, …, 60,000}$, to train a UNet model from
Diffusers
library, augmented with a class-embedding layer for digit label conditioning - to synthesize USPS digits $({0, \ldots, 9})$, fine-tune this MNIST-pre-trained model on $n_t = 5103$ USPS training images, while keeping the class-embedding layer frozen
- generation quality is evaluated on a held-out test set of 2188 USPS images using the 1-Wasserstein distance between real and generated distributions
- use the MNIST dataset with varying training sample sizes, $n_s \in {1000, 5000, …, 60,000}$, to train a UNet model from
- unconditional generation
- restrict the task to digit 3 images
- start from a diffusion model pre-trained on MNIST and fine-tune it on USPS digit-3 samples
- during pre-training, vary the MNIST source size
- generation quality is measured by the 1-Wasserstein distance on an independent test split of 198 USPS digit-3 images
in both conditional and unconditional settings, the Wasserstein error of the transfer-diffusion model decreases monotonically as the MNIST pre-training set grows
7. Conclusion
- introduced a shared embedding framework