Review on Normalizing Flows
Posted on (Update: )
Normalizing Flows are generative models which produce tractable distributions where both sampling and density evaluation can be efficient and exact.
a major goal: model a probability distribution given samples drawn from that distribution
this is
- an example of unsupervised learning
- sometimes called generative modelling
applications:
- density estimation
- outlier detection
- prior construction
- dataset summarization
many methods for generative modelling:
- direct analytic approaches approximate observed data with a fixed family of distributions
- variational approaches and expectation maximization introduce latent variables to explain the observed data
- additional flexibility but can increase the complexity of learning and inference
- graphical models explicitly model the conditional dependence between random variables
- generative neural approaches: generative adversarial networks (GANs) and variational auto-encoders (VAEs)
- neither allows for exact evaluation of the probability density of new points
- training can be challenging due to a variety of phenomena including model collapse, posterior collapse, vanishing gradients and training instability
normalizing flows (NF) are a family of generative models with tractable distributions where both sampling and density evaluation can be efficient and exact.
Background
A Normalizing Flow is a transformation of a simple probability distribution (e.g., a standard normal) into a more complex distribution by a sequence of invertible and differentiable mappings.
The density of a sample can be evaluated by transforming it back to the original simple distribution and then computing the product of
- the density of the inverse-transformed sample under this distribution
- the associated change in volume induced by the sequence of inverse transformations
the result of this approach is a mechanism to construct new families of distributions by choosing an initial density and then chaining together some number of parameterized, invertible, and differentiable transformations
Basics
- $Z \in \IR^D$: r.v. with a known and tractable pdf $p_Z: \IR^D \rightarrow \IR$
- $g$: an invertible function, $Y = g(Z)$
the new density is called a pushforward of the density $p_Z$ by the function $g$ and denoted by $g\star p_Z$
- generative direction: from base density to final complicated density, $y = g(z)$
- normalizing direction: inverse function $f$ moves (or flows) in the opposite
the term is doubly accurate if the base measure $p_z$ is chosen as a Normal distribution as it often is in practice
normalizing flows: mean bijections which are convenient to compute, invert, and calculate the determinant of their Jacobian.
the composition of invertible functions is itself invertible and the determinant of its Jacobian has a specific form.
Let $g_1,\ldots, g_N$ be a set of $N$ bijective functions and define
\[g = g_N\circ g_{N-1}\circ \cdots \circ g_1\]to be the composition pf the functions
then $g$ is also bijective, with inverse
\[f = f_1\circ \circ f_{N-1}\circ f_N\]and the determinant of the Jacobian is
\[\det Df(y) = \prod_{i=1}^N \det Df_i(x_i)\]denote the value of the $i$-th intermediate flow as
\[x_i = g_i\circ \cdots \circ g_1(z) = f_{i+1} \circ \cdots \circ f_N(y),\]and so $x_N = y$
applications
Density estimation and sampling
assume that
- only a single flow, $g$, is used and is parameterized by the vector $\theta$
- the base measure $p_Z$ is given and is parameterized by the vector $\phi$
given a set of data observed from some complicated distribution, $\cD = {y^{(i)}}_{i=1}^M$, we can then perform likelihood-based estimation of the parameters $\Theta = (\theta, \phi)$, the data likelihood becomes
\[\log p(\cD\mid \Theta) = \sum_{i=1}^M \log p_Y(y^{(i)}\mid \Theta) = \sum_{i=1}^M \log p_Z(f(y^{(i)}\mid \theta)\mid \phi) + \log \vert \det Df(y^{(i)}\mid \theta)\vert\]while MLE is often effective, other forms of estimation can and have been used with normalizing flows. In particular, adversarial losses can be used with normalizing flow models (e.g., Flow-GAN)
Variational Inference
use variational inference and introduce the approximate posterior $q(y\mid x, \theta)$
this is done by minimizing the KL divergence $D_{KL}(q(y\mid x, \theta)\Vert p(y\mid x))$
reparameterize $q(y\mid x, \theta) = p_Y(y\mid \theta)$ with normalizing flows.
assume that only a single flow $g$ with parameters $\theta$ is used, $y = g(z\mid \theta)$ and the base distribution $p_Z(z)$ does not depend on $\theta$, then
\[\bbE_{p_Y(y\mid \theta)}[h(y)] = \bbE_{p_Z(z)}[h(g(z\mid\theta))]\]reparametrization trick
in this scenario, evaluating the likelihood is only required at points which have been sampled
3. Methods
3.1 Elementwise Flows
a basic form of bijective non-linearity can be constructed given any bijective scalar function.
that is, let $h:\IR\rightarrow\IR$ be a scalar valued bijection, then if $x = (x_1,\ldots, x_D)^T$,
\[g(x) = (h(x_1), h(x_2), \ldots, h(x_D))^T\]is also a bijective whose inverse simply requires computing $h^{-1}$ and whose Jacobian is the product of the absolute values of the derivatives of $h$
in deep learning terminology, $h$ could be viewed as an “activation function”
note that the most commonly used activation function ReLU is not bijective and can not be directly applicable, but the (Parametric) Leaky ReLU can be used instead
3.2 Linear Flows
Elementwise operations alone are insufficient as they cannot express any form of correlation between dimensions
\[g(x) = Ax + b\]3.2.1 Diagonal
reduce to elementwise
3.2.2 Triangular
the triangular matrix is a more expressive form of linear transformation whose determinant is the product of its diagonal
3.2.3 Permutation and Orthogonal
orthogonal matrices parameterized by the Householder transformation
fact: any orthogonal matrix can be written as a product of reflections
3.2.4 Factorizations
\[g(x) = PLUx + b\]3.2.5 Convolution
Hoogeboom et al. (2019): a more general solution for modelling $d\times d$ convolutions,
- either by stacking together masked autoregressive convolutions (referred to as Emerging Convolutions)
- or by exploiting the Fourier domain representation of convolution (referred to as Periodic Convolutions)
3.3 Planar and Radial Flows
3.3.1 Planar Flows
expand and contract the distributions along certain specific directions and take the form
\[g(x) = x + uh(w^Tx + b)\]3.3.2 Radial Flows
modify the distribution around a specific point so that
\[g(x) = x + \frac{\beta}{\alpha + \Vert \bfx - \bfx_0\Vert} (\bfx - \bfx_0)\]3.4 Coupling and Autoregressive Flows
3.4.1 Coupling Flows
consider a disjoint partition of the input $x\in \IR^D$ into two subspaces: $(x^A, x^B) \in \IR^d \times \IR^{D-d}$ and a bijection $h(\cdot;\theta): \IR^d\rightarrow\IR^d$
\[y^A = h(x^A; \Theta(x^B))\\ y^B = x^B\]3.4.2 Autoregressive Flows
\[y_t = h(x_t; \Theta_t(x_{1:t-1})) \tag{18}\] \[y_t = h(x_t; \theta_t(y_{1:t-1})) \tage{20}\]3.4.3 Universality
Universality means that the flow can learn any target density to any required precision given sufficient capacity and data
3.4.4 Coupling Functions
- affine coupling
- nonlinear squared flow
- continuous mixture CDFs
- splines
- neural autoregressive flow
- sum-of-squares polynomial flow
- piecewise-bijective coupling
3.5 Residual Flows
residual networks are compositions of the function of the form
\[g(x) = x + F(x)\]Such a function is called a residual connection, and the residual block $F(\cdot)$ is a feed-forward NN of any kind
attempts to build a reversible network architecture based on residual connections: RevNets and iRevNets
3.6 Infinitesimal (Continuous) Flows
3.6.1 ODE-based methods
a first order ODE
\[\frac{d}{dt}x(t) = F(x(t), \theta(t))\]3.6.2 SDE-based methods (Langevin flows)
start with a complicated and irregular data distribution, and then mix it to produce the simple base distribution $p_Z(z)$
if the mixing obeys certain rules, then this procedure can be invertible.
A stochastic differential equation (SDE) or Itô process describes a change of a random variable $x\in \IR^D$ as a function of time $t\in \IR_+$
\[dx(t) = b(x(t), t)dt + \sigma(x(t), t) dB_t\]where
- $b(x, t)$: the drift coefficient
- $\sigma(x, t)\in \IR^{D\times D}$: diffusion coefficient
- $B_t$: $D$-dimensional Brownian motion
4 Datasets and Performance
- Tabular datasets
- Image datasets
5. Discussion and Open Problems
5.1 Inductive biases
- Role of the base measure
- Form of diffeomorphisms
- Loss function
5.2 Generalization to non-Euclidean spaces
- Flows on manifolds
- Discrete distributions