WeiYa's Work Yard

A traveler with endless curiosity, who fell into the ocean of statistics, tries to write down his ideas and notes to save himself.

Finite Mixture Models

Posted on
Tags: Mixture Model, Shape-constrained

This note is for Wang, H., Ibrahim, S., & Mazumder, R. (2023). Nonparametric Finite Mixture Models with Possible Shape Constraints: A Cubic Newton Approach (No. arXiv:2107.08535). arXiv. https://doi.org/10.48550/arXiv.2107.08535

the paper considers computational aspects of MLE of the mixture proportional of a nonparameteric finite mixture model——a convex optimization problem with old roots in statistics and a key member of the modern data analysis toolkit

motivated by problems in shape constrained inference, they consider structured variants of this problem with additional convex polyhedral constraints

they propose a new cubic regularized Newton method for this problem and present novel worst-case and local computational guarantees for the algorithm

extend earlier work by Nesterov and Polyak to the case of a self-concordant objective with polyhedral constraints

propose a Frank-Wolfe method to solve the cubic regularized Newton subproblem

in the particular case of Gaussian mixtures without shape constraints, they derive bounds on how well the finite mixture problem approximates the inifite-dimensional Kiefer-Wolfowitz maximum likelihood estimator

Introduction

consider a one-dimensional mixture density of the form: $g(x) = \sum_{i=1}^Mw_i\psi_i(x)$ where, the $M$ components (probability densities) ${\psi_i(x)}_1^M$ are known, but the proportions ${w_i}_1^n$ are unknown, and need to be estimated from the data at hand

Learning nonparametric mixture proportions

Suppose we are given $N$ samples ${X_j}_1^N$, independently from $g(x)$, one can obtain a MLE of the mixture proportions via the following convex optimization problem

\[\min_{w} -\frac 1N\sum_{j\in [N]}\log (\sum_{i\in [M]}w_iB_i)\\ s.t. $w\in \Delta_M := \{w: \sum_{i\in [M]}w_i=1, w_i\ge 0, i\in [M]\}$\]

where $B_{ij} = \psi_i(X_j)$

one of the goals is to present new algorithms with associated computational guarantees, that scale to instances with $N\approx 10^6$ and $M\approx 10^3$

a rich story in statistics:

  • in the context of empirical Bayes estimation,

several choices of the bases functions ${\psi_i}$ are possible

a notable example arising in Gaussian sequence models, is to consider $\psi_i(x) = \phi(x-\mu_i)$. Then the problem can be interpreted as a finite-dimensional approximation of the original formulation of the Kiefer-Wolfowitz nonparametric MLE

given an infinite mixture $g_Q(x) = \int_R\varphi(x - \mu) dQ(\mu)$ where, $Q$ is a probability distribution on $\IR$, the Kiefer-Wolfowitz nonparametric MLE estimates $Q$ given $N$ independent samples from $g_Q$

this infinite dimensional problem admits a finite dimensional solution with $Q$ supported on $N$ unknown atoms

since these atoms can be hard to locate, it is common to resort to discrete approximations for a suitable pre-specified grid of ${\mu_i}$-values

the paper presents bounds that quantify how well an optimal solution to a discretized problem (with a-priori specified atoms), approximates a solution to the infinite dimensional Kiefer-Wolfowitz MLE problem.

versions of problem originated in the 1890s, the utility of this estimator and the abundance of large-scale datasets necessitated solving it at scale (e.g., with $N \approx 10^5 - 10^6$ and $M\approx 10^3$)

a popular algorithm is based on the EM algorithm, which is known to have slow convergence behavior

  • solving a dual via Mosek’s interior point solver, led to improvements over EM-based methods
  • an approach based on sequential quadratic programming that offers notable improvements
  • low-rank rank approximations of $B$, active set updates, among other clever tricks to obtain computational improvements

global complexity guarantees of these methods appear to be unknown

the paper presents a novel computational framework (with global complexity guarantees) and a constrained variant of this problem

Learning nonparametric mixtures with shape constraints

consider a structured variant which allows for density estimation under shape constraints.

assume the density $x\mapsto \sum_{i\in [M]}w_i\psi_i(x)$ obeys a pre-specified shape constraint such as monotonicity, convexity, unimodality

consider a special family of basis functions, the Bernstein polynominal bases, which are well-known for their

  • appealing shape-preserving properties
  • ability to approximate any smooth function over a compact support to high accuracy

given an interval $[0, 1]$, the Bernstein polynominals of degree $M$ are given by a linear combination of the following Bernstein basis elements

\[\tilde b_m(x) = \begin{cases} \frac{\Gamma(M+1)}{\gamma(m)\Gamma(M-m+1)} x^{m-1}(1-x)^{M-m} & x\in [0,1]\\ 0 & \text{otherwise} \end{cases}\]

note that $\tilde b_m(x)$ is a Beta-density with shape parameters $m$ and $M-m+1$; and $g_B(x)$ can be interpreted as a finite mixture of Beta-densities.

Estimation under polyhedral shape constraints

an appealing property of the Bernstein bases representation is their shape preserving nature: a shape constraint on $m\mapsto w_m$ translates to a shape constraint on $x\mapsto g_B(x)$, restricted to its support $[0, 1]$

  • if $w_m\le w_{m+1}, m\in [M-1]$, i.e., $m \mapsto w_m$ is an increasing sequence, then $x\mapsto g_B(x)$ is increasing. Similarly, if $m\mapsto w_m$ is decreasing, then $x\mapsto g_B(x)$ is a decreasing density
  • if $2w_m\ge w_{m-1}+w_{m+1}$ for all $2\le m\le M-1$, i.e., $m\mapsto w_m$ is concave, then $x\mapsto g_B(x)$ is concave. Similarly, if $m\mapsto w_m$ is convex, then $x\mapsto g_B(x)$ is convex.

Published in categories