UMAP: Uniform Manifold Approximation and Projection
Posted on
This note is for McInnes, L., Healy, J., & Melville, J. (2020). UMAP: Uniform Manifold Approximation and Projection for Dimension Reduction (No. arXiv:1802.03426). arXiv.
UMAP is constructed from a theoretical framework based in Riemannian geometry and algebric topology.
dimension reduction algorithms tends to fall into two categories
- preserve the pairwise distance structure amongst all the data samples: PCA, MDS, Sammon mapping
- favor the preservation of local distances over global distances: t-SNE, Isomap, LargeVis, Laplacian eigenmaps, diffusion maps
2 Theoretical Foundations for UMAP
2.1 Uniform distribution of data on a manifold anmd geodesic approximation
the first step is to approximate the manifold the paper assumes the data (approximately) lies on
3 A Computational View of UMAP
UMAP can ultimately be described in terms of, construction of, and operations on, weighted graphs.
this situates UMAP in the class of k-neighbor based graph learning algorithms such as Laplacian Eigenmaps, Isomap and t-SNE
as with other k-neighbour graph based algorithms, UMAP can be described in two phases.
in the first phase a particular weighted k-neighbour graph is constructed
in the second phase a low dimensional layout of this graph is computed
UMAP assume the following axioms
- there exists a manifold on which the data would be uniformly distributed
- the underlying manifold of interest is locally connected
- preserving the topological structure of this manifold is the primary goal
any algorithm that attempts to use a mathematical structure akin to a k-neighbor graph to approximate a manifold must follow a similar basic structure
- Graph Construction
- Construct a weighted k-neighbor graph
- apply some transform on the edges to ambient local distance
- deal with the inherent asymmetry of the k-neighbor graph
- Graph Layout
- define an objective function that preserves desired characteristics of this k-neighbor graph
- find a low dimensional representation which optimizes this objective function
3.1 Graph Construction
- $X = {x_1,\ldots, x_N}$: the input dataset
- metric (dissimilarity measure) $d: X\times X\rightarrow \IR_{\ge 0}$
- ${x_{i_1},\ldots, x_{i_k}}$: $k$ nearest neighbors of $x_i$ under the metric $d$
for each $x_i$, define $\rho_i$ and $\sigma_i$, let
\[\rho_i = \min\{d(x_i, x_{i_j})\mid 1\le j\le k, d(x_i, x_{i_j}) > 0\}\]and set $\sigma_i$ to be the value such that
\[\sum_{j=1}^k \exp\left(\frac{-\max(0, d(x_i, x_{i_j}) - \rho_i )}{\sigma_i}\right) = \log_2(k)\]define a weighted directed graph $\bar G = (V, E, w)$, where the weight function is
\[w((x_i, x_{i_j})) = \exp\left(\frac{-\max(0, d(x_i, x_{i_j}) - \rho_i )}{\sigma_i}\right)\]let $A$ be the weighted adjacency matrix of $\bar G$, and consider the symmetric matrix
\[B = A + A^\top - A \circ A^\top\]where $\circ$ is the Hadamard (or pointwise) product
if one interprets the value of $A_{ij}$ as the probability that the directed edge from $x_i$ to $x_j$ exists, then $B_{ij}$ is the probability that at least one of the two directed edges (from $x_i$ to $x_j$ and from $x_j$ to $x_i$) exists.
The UMAP graph $G$ is then an undirected weighted graph whose adjacency matrix is given by $B$
3.2 Graph Layout
UMAP uses a force directed graph layout algorithm in low dimensional space
any force directed layout algorithm requires a description of both attractive and replusive forces. the algorithms proceeds by iteratively applying attractive and repulsive forces at each edge or vertex
the inputs are:
- $X$: the dataset to have its dimension reduced
- $n$: the neighborhood size to use for local metric approximation
- $d$: the dimension of the target reduced space
- min-dist: an algorithmic parameter controlling the layout
- n-epochs: control the amount of optimization work to perform
5. Practical Efficacy
- Pen digits
- COIL 20
- COIL 100
- Mouse scRNA-seq
- Statlog (Shuttle)
- MNIST
- F-MNIST
- Flow cytometry
- GoogleNews word vectors
5.1 Qualitative Comparison of Multiple Algorithms
5.2 Quantitative Comparison of Multiple Algorithms
the performance of a kNN classifier trained on the embedding space for a variety of datasets
5.3 Embedding Stability
since UMAP makes use of both stochastic approximate nearest neighbor search, and stochastic gradient descent with negative sampling for optimization, the resulting embedding is necessarily different from run to run, and under sub-sampling of the data.
to measure the stability of an embedding we make use of the normalized Procrustes distance to measure the distance between two potentially comparable distributions.
5.4 Computational Performance Comparisons
- scaling with embedding dimension
- scaling with ambient dimension
- scaling with the number of samples
6 Weakness
- similarly to most non-linear dimension reduction techniques (including t-SNE and Isomap), UMAP lacks the strong interpretability of PCA and related techniques such a NMF
- one of the core assumptions of UMAP is that there exists manifold structure in the data. care must be taken with smaller sample sizes of noisy data
- UMAP is derived from the axiom that local distance is of more importance than long range distances (similar to techniques like t-SNE and LargeVis)
- while the paper believes