Review on Normalizing Flows

Posted on Nov 22, 2024 (Update: Nov 23, 2024)

Tags: Generative Models, Normalizing Flows, Density Estimation, Variational Inference, Invertible Neural Networks

This note is for Kobyzev, I., Prince, S. J. D., & Brubaker, M. A. (2020). Normalizing Flows: An Introduction and Review of Current Methods (No. arXiv:1908.09257). arXiv. http://arxiv.org/abs/1908.09257

Normalizing Flows are generative models which produce tractable distributions where both sampling and density evaluation can be efficient and exact.

a major goal: model a probability distribution given samples drawn from that distribution

this is

an example of unsupervised learning
sometimes called generative modelling

applications:

density estimation
outlier detection
prior construction
dataset summarization

many methods for generative modelling:

direct analytic approaches approximate observed data with a fixed family of distributions
variational approaches and expectation maximization introduce latent variables to explain the observed data
- additional flexibility but can increase the complexity of learning and inference
graphical models explicitly model the conditional dependence between random variables
generative neural approaches: generative adversarial networks (GANs) and variational auto-encoders (VAEs)
- neither allows for exact evaluation of the probability density of new points
- training can be challenging due to a variety of phenomena including model collapse, posterior collapse, vanishing gradients and training instability

normalizing flows (NF) are a family of generative models with tractable distributions where both sampling and density evaluation can be efficient and exact.

Background

A Normalizing Flow is a transformation of a simple probability distribution (e.g., a standard normal) into a more complex distribution by a sequence of invertible and differentiable mappings.

The density of a sample can be evaluated by transforming it back to the original simple distribution and then computing the product of

the density of the inverse-transformed sample under this distribution
the associated change in volume induced by the sequence of inverse transformations

the result of this approach is a mechanism to construct new families of distributions by choosing an initial density and then chaining together some number of parameterized, invertible, and differentiable transformations

Basics

$Z \in \IR^D$: r.v. with a known and tractable pdf $p_Z: \IR^D \rightarrow \IR$
$g$: an invertible function, $Y = g(Z)$

\[p_Y(y) = p_Z(f(y))\vert \det Df(y)\vert = p_Z(f(y))\vert\det Dg(f(y))\vert^{-1}\]

the new density is called a pushforward of the density $p_Z$ by the function $g$ and denoted by $g\star p_Z$

generative direction: from base density to final complicated density, $y = g(z)$
normalizing direction: inverse function $f$ moves (or flows) in the opposite

the term is doubly accurate if the base measure $p_z$ is chosen as a Normal distribution as it often is in practice

normalizing flows: mean bijections which are convenient to compute, invert, and calculate the determinant of their Jacobian.

the composition of invertible functions is itself invertible and the determinant of its Jacobian has a specific form.

Let $g_1,\ldots, g_N$ be a set of $N$ bijective functions and define

\[g = g_N\circ g_{N-1}\circ \cdots \circ g_1\]

to be the composition pf the functions

then $g$ is also bijective, with inverse

\[f = f_1\circ \circ f_{N-1}\circ f_N\]

and the determinant of the Jacobian is

\[\det Df(y) = \prod_{i=1}^N \det Df_i(x_i)\]

denote the value of the $i$-th intermediate flow as

\[x_i = g_i\circ \cdots \circ g_1(z) = f_{i+1} \circ \cdots \circ f_N(y),\]

and so $x_N = y$

applications

Density estimation and sampling

assume that

only a single flow, $g$, is used and is parameterized by the vector $\theta$
the base measure $p_Z$ is given and is parameterized by the vector $\phi$

given a set of data observed from some complicated distribution, $\cD = {y^{(i)}}_{i=1}^M$, we can then perform likelihood-based estimation of the parameters $\Theta = (\theta, \phi)$, the data likelihood becomes

\[\log p(\cD\mid \Theta) = \sum_{i=1}^M \log p_Y(y^{(i)}\mid \Theta) = \sum_{i=1}^M \log p_Z(f(y^{(i)}\mid \theta)\mid \phi) + \log \vert \det Df(y^{(i)}\mid \theta)\vert\]

while MLE is often effective, other forms of estimation can and have been used with normalizing flows. In particular, adversarial losses can be used with normalizing flow models (e.g., Flow-GAN)

Variational Inference

use variational inference and introduce the approximate posterior $q(y\mid x, \theta)$

this is done by minimizing the KL divergence $D_{KL}(q(y\mid x, \theta)\Vert p(y\mid x))$

reparameterize $q(y\mid x, \theta) = p_Y(y\mid \theta)$ with normalizing flows.

assume that only a single flow $g$ with parameters $\theta$ is used, $y = g(z\mid \theta)$ and the base distribution $p_Z(z)$ does not depend on $\theta$, then

\[\bbE_{p_Y(y\mid \theta)}[h(y)] = \bbE_{p_Z(z)}[h(g(z\mid\theta))]\]

reparametrization trick

in this scenario, evaluating the likelihood is only required at points which have been sampled

3. Methods

3.1 Elementwise Flows

a basic form of bijective non-linearity can be constructed given any bijective scalar function.

that is, let $h:\IR\rightarrow\IR$ be a scalar valued bijection, then if $x = (x_1,\ldots, x_D)^T$,

\[g(x) = (h(x_1), h(x_2), \ldots, h(x_D))^T\]

is also a bijective whose inverse simply requires computing $h^{-1}$ and whose Jacobian is the product of the absolute values of the derivatives of $h$

in deep learning terminology, $h$ could be viewed as an “activation function”

note that the most commonly used activation function ReLU is not bijective and can not be directly applicable, but the (Parametric) Leaky ReLU can be used instead

3.2 Linear Flows

Elementwise operations alone are insufficient as they cannot express any form of correlation between dimensions

\[g(x) = Ax + b\]

3.2.1 Diagonal

reduce to elementwise

3.2.2 Triangular

the triangular matrix is a more expressive form of linear transformation whose determinant is the product of its diagonal

3.2.3 Permutation and Orthogonal

orthogonal matrices parameterized by the Householder transformation

fact: any orthogonal matrix can be written as a product of reflections

3.2.4 Factorizations

\[g(x) = PLUx + b\]

3.2.5 Convolution

Hoogeboom et al. (2019): a more general solution for modelling $d\times d$ convolutions,

either by stacking together masked autoregressive convolutions (referred to as Emerging Convolutions)
or by exploiting the Fourier domain representation of convolution (referred to as Periodic Convolutions)

3.3 Planar and Radial Flows

3.3.1 Planar Flows

expand and contract the distributions along certain specific directions and take the form

\[g(x) = x + uh(w^Tx + b)\]

3.3.2 Radial Flows

modify the distribution around a specific point so that

\[g(x) = x + \frac{\beta}{\alpha + \Vert \bfx - \bfx_0\Vert} (\bfx - \bfx_0)\]

3.4 Coupling and Autoregressive Flows

3.4.1 Coupling Flows

consider a disjoint partition of the input $x\in \IR^D$ into two subspaces: $(x^A, x^B) \in \IR^d \times \IR^{D-d}$ and a bijection $h(\cdot;\theta): \IR^d\rightarrow\IR^d$

\[y^A = h(x^A; \Theta(x^B))\\ y^B = x^B\]

Dinh, L., Sohl-Dickstein, J., & Bengio, S. (2017). Density estimation using Real NVP (No. arXiv:1605.08803). arXiv. https://doi.org/10.48550/arXiv.1605.08803

3.4.2 Autoregressive Flows

\[y_t = h(x_t; \Theta_t(x_{1:t-1})) \tag{18}\] \[y_t = h(x_t; \theta_t(y_{1:t-1})) \tage{20}\]

3.4.3 Universality

Universality means that the flow can learn any target density to any required precision given sufficient capacity and data

3.4.4 Coupling Functions

affine coupling
nonlinear squared flow
continuous mixture CDFs
splines
neural autoregressive flow
sum-of-squares polynomial flow
piecewise-bijective coupling

3.5 Residual Flows

residual networks are compositions of the function of the form

\[g(x) = x + F(x)\]

Such a function is called a residual connection, and the residual block $F(\cdot)$ is a feed-forward NN of any kind

attempts to build a reversible network architecture based on residual connections: RevNets and iRevNets

3.6 Infinitesimal (Continuous) Flows

3.6.1 ODE-based methods

a first order ODE

\[\frac{d}{dt}x(t) = F(x(t), \theta(t))\]

3.6.2 SDE-based methods (Langevin flows)

start with a complicated and irregular data distribution, and then mix it to produce the simple base distribution $p_Z(z)$

if the mixing obeys certain rules, then this procedure can be invertible.

A stochastic differential equation (SDE) or Itô process describes a change of a random variable $x\in \IR^D$ as a function of time $t\in \IR_+$

\[dx(t) = b(x(t), t)dt + \sigma(x(t), t) dB_t\]

where

$b(x, t)$: the drift coefficient
$\sigma(x, t)\in \IR^{D\times D}$: diffusion coefficient
$B_t$: $D$-dimensional Brownian motion

4 Datasets and Performance

Tabular datasets
Image datasets

5. Discussion and Open Problems

5.1 Inductive biases

Role of the base measure
Form of diffeomorphisms
Loss function

5.2 Generalization to non-Euclidean spaces

Flows on manifolds
Discrete distributions

Published in categories

← previous next →

See all posts →

WeiYa's Work Yard

A traveler with endless curiosity, who fell into the ocean of statistics, tries to write down his ideas and notes to save himself.