WeiYa's Work Yard

A traveler with endless curiosity, who fell into the ocean of statistics, tries to write down his ideas and notes to save himself.

Generative Models via Transfer Learning

Posted on
Tags: Diffusion Model, Transfer Learning, Normalizing Flows, Generative Models

this note is for Tian, X., & Shen, X. (2025). Enhancing Accuracy in Generative Models via Knowledge Transfer (No. arXiv:2405.16837). arXiv.

the paper investigate the accuracy of generative models and the impact of knowledge transfer on their generation precision

  • examine a generative model for a target task, fine-tuned using a pre-trained model from a source task
  • build on the “shared embedding” concept, which bridges the source and target tasks, they introduce a novel framework for transfer learning under distribution metrics such as the KL divergence
  • the framework underscores the importance of leveraging inherent similarities between diverse tasks despite their distinct data distributions
  • the theory suggests that the shared structures can augment the generation accuracy for a target task, reliant on the capability of a source model to identify shared structures and effective knowledge transfer from source to target learning
  • to demonstrate the practical utility of this framework, they explores the theoretical implications for two specific generative models: diffusion and normalizing flows
    • the results show enhanced performance in both models over their non-transfer counterparts, indicating advancements for diffusion models
    • and providing fresh insights into normalizing flows in transfer and non-transfer settings

1. Introduction

generative modeling, augmented with transfer learning

  • this process distills knowledge from extensive, pretrained models previously trained on large datasets from relevant studies, enabling domain adaptation for specific tasks
  • at its core is the dynamic between the source (pre-trained) and target (fine-tuning) learning tasks, which tend to converge towards shared, concise representations
    • this principle less attention in diffusion models and normalizing flows

this paper presents a theoretical framework to assess the accuracy of outputs from generative models

  • accurately evaluating the fidelity of data produced by generative models is increasingly critical for downstream analyses and for maintaining users’ trust in synthetic data
  • although empirical studies show that tranfer learning improves diffusion-based generators for both images and tabular data, its theoretical effect on generative accuracy remains underexplored
  • poorly matched source tasks can even induce negative transfer, degrading performance and jeopardizing truthworthy AI goals through misleading scientific conclusions
  • by contrast, transfer learning in supervised settings has been thoroughly analyzed, underscoring the need for principled study in the generative realm
  • complementing diffusion and flow research, Generative Adversarial Networks (GANs) provide a mature toolkit for domain adaptation
    • feature-level alignment via Domain-Adversarial Training
    • unpaired image-to-image translation with CycleGAN
    • multi-domain or data-efficient extension such as StarGAN
    • these methods deomnstrate that adversarial alignment – whether in latent or pixel space — remains an effective paradigm for cross-domain generation

review the relevant literature on the accuracy of two advanced generative models, diffusion models and normalizing flows

  • diffusion models
    • Oko et al. (2023): convergence rates for unconditional generation for smooth densities
    • Chen et al. (2023b): distribution recovery over a low-dimensional linear subspace
    • although a conditional diffusion model has show effectiveness, its theoretical foundation remains underexplored
    • Fu et al. (2024): conditional diffusion models under a smooth density assumption
  • flows
    • the study of generation accuracy for flows remains sparse, with limited exceptions on universal approximation

the paper claims to address the accuracy of target generation

  • the generation accuracy, measured by the excess risk, induces several valuable metrics such as the Kullback-Leibler (KL) divergence to assess the distribution closeness

contrubutions:

  1. generation accruacy theory:
    • introduce the concept of “shared embedding” condition (SEC) to quantify the similarities between the latent representations of source and target learning
    • the SEC distinguishes between conditional and unconditional generation by featuring nonlinear dimension reduction for the former while capturing shared latent representations through embeddings for the latter
  2. diffusion models and normalizing flows via transfer learning
    • examine conditional generation with the KL divergence and the TV-norm for smooth target distributions and unconditional generation with the dimension-scaling Wasserstein distance, specifically in diffusion and coupling flows
    • the analysis reveals that utilizing transfer learning strategies——grounded in the shared embeddings structures within the lower-dimensional manifold that bridge the source and target learning——holds the potential to elevate performance over non-transfer methodologies
  3. non-transfer diffusion models and normalizing flows
    • diffusion models structured with the SEC framework achieve a faster KL rate than their non-transfer analogs in the TV-norm for conditional generation with a smooth density
    • in conditional generation, their method exhibits a faster rate under the Wassertein distance relative to that under the TV-norm
    • crucially, their analysis of coupling flows reveals its competitiveness compared to diffusion models in both conditional and unconditional generation

2. Enhancing generation accuracy and knowledge transfer

  • Assumption 1: independent source and target data

2.1 Conditional generation

  • traina a conditional generator for $X_t$ given $Z_t$ using $D_t = {x_t^i, z_t^i}_{i=1}^{n_t}$
  • independent source training sample $D_s = {x_s^i, z_s^i}_{i=1}^{n_s}$

SEC for conditional generation:

  • $X_t$ and $X_s$ are allowed to differ in dimensionality
  • sample from $P_{X_t\mid Z_t}$

decompose the auxiliary vectors as

\[Z_t = (Z, Z_{t^c}), \qquad Z_s = (Z, Z_{s^c})\]

where the common block $Z\in \IR^{d_c}$ is shared across tasks

Shared Embedding Condition (SEC)

  • assume there exists a latent representation $h(Z)$ that is common to both tasks such that the conditional laws factor through task-specific decoders $P_t$ and $P_s$:
\[P_{x_t\mid z_t}(\cdot\mid z_t) = P_t(\cdot, h_t(z_t)),\qquad P_{x_s\mid z_s}(\cdot\mid z_s) = P_s(\cdot, h_s(z_s))\]

where $h_j(z_j) = (h(z), z_{j^c})$ and $P_j$ is a suitable probability function

  • SEC presents a dimension reduction framework
  • e.g., a source task of text prompt to image ($Z_s$ to $X_s$) and a target task of text prompt-to-music ($Z_t$ to $X_t$) may share common elements $Z$ based on a latent semantic representation $h(Z)$ and task-specific elements $Z_{j^c}$

parameterize

\[P_{x_j\mid z_j}(x, z_j) = P_j(x, h_j(z_j), \theta_j)\]

with $h$ from $\Theta_h$

this approach defines the true distribution

\[P_{x_j\mid z_j}^0(x, z_j) = P_j(x, h_j^0(z_j); \theta_j^0)\]

through true parameters $h_j^0(z_j) = (h^0(z), z_{j^c})$ and

\[(\theta_j^0, h^0) = \argmin_{\theta_j\in G_j, h\in \Theta_h} E_{x_j, z_j}l_j(X_j, Z_j;\theta_j, h)\]

for the source task, minimize the empirical loss on a source training sample

\[(\hat\theta_s, \hat h) = \argmin_{\theta_s\in \Theta_s, h\in \Theta_h} L_s(\theta_s, h)\]

the distribution discrepancy is controlled by the excess risk

\[E_{x_j, z_j}[l_j(X_j, Z_j; \theta_j, h) - l_j(X_j, Z_j; \theta_j^0, h^0)]\]

for example, the negative log-likelihood loss yields an error bound in the excess risk, implying that in the KL divergence

  • Assumption 2: transferability for conditional models
    • characterizes the transitions of the excess risk for the latent structural representation $h$ from source to target tasks
  • Assumption 3: source error

2.2 Unconditional generation

sample from the marginal target distribution $P_{X_t}$, transfer a latent representation learned on the source task

the SEC postulates that the source and target variables, $X_s$ and $X_t$, arise from a shared latent vector $U$ through task-specific decoders

\[X_t = g_t(U), \qquad X_s = g_s(U)\]

consequently

\[P_{x_t}(\cdot) = P_u(g_t^{-1}(\cdot)),\qquad P_{x_s}(\cdot) = P_u(g_s^{-1}(\cdot))\]

e.g., a source task of French text generation $X_s$ from English and target task of Chinese text generation $X_t$ from English

initially, the numerical embedding of a textual decription $U$ in English is transformed into French and Chinese using $g_s$ and $g_t$, respectively

given a latent representation ${u_s^i}{i=1}^{n_s}$ for ${x_s^i}{i=1}^{n_s}$ in the source training sample $D_s$ encoded by an encoder, first estimate the latent distribution $P_u$ using a generative model parameterized by $\Theta_u$ as $\hat P_u = P_u(\cdot; \hat\theta_u)$

with the estimated latent distribution $\hat P_u$, $g_t$ is estimated based on a target training sample

  • Assumption 4: Source error for $U$

3. Diffusion models

3.1 Forward and Backward Processes

  • forward process
  • backward process
  • score matching: to estimate the unknown score function, minimize a matching loss between the score and its approximator

3.2 Conditional diffusion via transfer learning

to compare transfer conditional and non-transfer diffusion generations, adopt the framework of the transfer model while omitting source learning

  • connection to Fu et al. (2024) for non-transfer conditional generation

3.3 Unconditional diffusion via transfer learning

  • Assumption 6: Smoothness of $g_t^0$

4. Normalizing flows

4.1 Coupling

normalizing flows transform a random vector $X$ into a base vector $V$ with known density $p_v$, through a diffeomorphic mapping $T(X)$, which is invertible and differentiable

the composition of these mappings, $T = \phi_K \circ \cdots \phi_1$, with each $\phi_j$ modeled by a neural network

the density of $X$ is expressed as

\[p_x(x) = p_v(T(x)) \lvert \text{det} \frac{\partial T(x)}{\partial x}\rvert\]

with the determinant indicating the volume change under $T$

the maximum likelihood approach is used to estimate $T$, enabling the generation of new $X$ samples by inverting $T$ on samples from $p_v$

Counpling flows

partition $x$ into two parts $x = [x_1, x_2]$

each flow employs a transfromation

\[\phi_j(x_1, x_2) = (x_1, q(x_2, w_j(x_1)))\]

the function $q$ modifies $x_2$ based on the output of $w$, where $q$ ensures that $\phi_j$ is invertible

Conditional coupling flows

to add an additional condition input of $z$ to the coupling layer, adjust

\[\phi_j(x_1, x_2, z) = (x_1, q(x_2, w(x_1, z)))\]

4.2 Conditional flows via transfer learning

use the three-layer coupling flows with $q(x, y) = x+ y$

  • Assumption 7 (Transformation)

4.3 Unconditional flows via transfer learning

  • Assumption 8: Smoothness of $g_t^0$

Comparison of diffusion and flows

  • generation accuracy
  • limit of assumptions
  • network architecture

5. Core Proof Strategy

  • propagating error from source to target under the SEC condition
  • controlling approximation and estimation errors in diffusion models
  • simultaneously approximating mappings and their derivatives in flow models

6. Numerical experiments

6.1 Simulations

  • conditional generation
  • unconditional generation

6.2 Benchmark example: MNIST-USPS Digit Images

MNIST-USPS, a challenge transfer-learning task because the two handwritten-digit corpora differ substantially in resolution, stroke style, and intrr-class variability

  • conditional generation
    • use the MNIST dataset with varying training sample sizes, $n_s \in {1000, 5000, …, 60,000}$, to train a UNet model from Diffusers library, augmented with a class-embedding layer for digit label conditioning
    • to synthesize USPS digits $({0, \ldots, 9})$, fine-tune this MNIST-pre-trained model on $n_t = 5103$ USPS training images, while keeping the class-embedding layer frozen
    • generation quality is evaluated on a held-out test set of 2188 USPS images using the 1-Wasserstein distance between real and generated distributions
  • unconditional generation
    • restrict the task to digit 3 images
    • start from a diffusion model pre-trained on MNIST and fine-tune it on USPS digit-3 samples
    • during pre-training, vary the MNIST source size
    • generation quality is measured by the 1-Wasserstein distance on an independent test split of 198 USPS digit-3 images

in both conditional and unconditional settings, the Wasserstein error of the transfer-diffusion model decreases monotonically as the MNIST pre-training set grows

7. Conclusion

  • introduced a shared embedding framework

Published in categories