WeiYa's Work Yard

A traveler with endless curiosity, who fell into the ocean of statistics, tries to write down his ideas and notes to save himself.

High Dimensional Linear Discriminant Analysis

Posted on (Update: )
Tags: Linear Discriminant Analysis, Minimax, High-Dimensional

This note is for Cai, T. T., & Zhang, L. (2019). High dimensional linear discriminant analysis: Optimality, adaptive algorithm and missing data. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 81(4), 675–705.

Introduction

Suppose $Z$ is drawn with equal probability from one of the two Gaussian distributions $N_p(\mu_1,\Sigma)$ (class 1) and $N_p(\mu_2,\Sigma)$ (class 2). If all the parameters $\theta = (\mu_1,\mu_2,\Sigma)$ are known, Fisher’s linear discriminant rule is given by

\[C_\theta(Z) = \begin{cases} 1, & \left(Z-\frac{\mu_1+\mu_2}{2}\right)^T\Omega\delta < 0\\ 2, & \left(Z-\frac{\mu_1+\mu_2}{2}\right)^T\Omega\delta \ge 0 \end{cases}\,,\]

where $\delta = \mu_2-\mu_1$, and $\Omega = \Sigma^{-1}$.

Note that

\[\begin{align*} \log \frac{\Pr(G=2\mid Z)}{\Pr(G=1\mid Z)} &= \log \frac{\Pr(Z\mid G=2)\pi_2}{\Pr(Z\mid G=1)\pi_1}\\\ &=-\frac 12\left[(z-\mu_2)'\Sigma^{-1}(z-\mu_2)- (z-\mu_1)'\Sigma^{-1}(z-\mu_1)\right]\\\ &=-\frac 12\left[(2z-\mu_2-\mu_1)\Sigma^{-1}(\mu_1-\mu_2)\right]\,. \end{align*}\]

The misclassification error is given by $R_{\opt}(\theta)=\Phi(-\Delta/2)$.

Let $Y=\delta^T\Omega Z$, then

\[\begin{align*} Y\mid G=1\sim N(\delta^T\Omega\mu_1, \delta^T\Omega\delta)\\ Y\mid G=2\sim N(\delta^T\Omega\mu_2, \delta^T\Omega\delta) \end{align*}\]

thus,

\[\begin{align*} P(Z\in R_2\mid G=1) &= P\left(Y\ge \left(\frac{\mu_1+\mu_2}{2}\right)^T\Omega\delta\mid G=1\right) \\ &=P\left(\frac{Y-\delta^T\Omega\mu_1}{\Delta}\ge \frac{\left(\frac{\mu_1+\mu_2}{2}\right)^T\Omega\delta-\mu_1^T\Omega\delta}{\Delta}\right)\\ &=1-\Phi(\Delta/2) = \Phi(-\Delta/2)\,. \end{align*}\]

However, the parameters $\mu_1,\mu_2$ and $\Sigma$ are typically unknown and need to estimated.

  • in the low dimensional setting, plug the sample means and pooled sample covariance matrix into Fisher’s rule.
  • in the high dimensional setting, sample covariance matrix is not even invertible
    • regularized classification methods: regularized LDA, covariance-regularized classification, hard thresholding. (all of them rely on the individual sparsity assumptions on $\Omega$ (or $\Sigma$) and $\delta$)
    • more flexible (really?) assumption is on the sparsity of $\beta=\Omega\delta$. Someone proposes to estimate the discriminant direction $\beta$ directly instead of estimating $\Sigma$ and $\delta$ separately, under the assumption that $\beta$ is sparse. (similarity with inverse method?)

Much recent progress in methodological development on high dimensional classification problems, but relatively little fundamental study on optimality theory for discriminant analysis:

someone conduct minimax study in the special case where $\Sigma=\sigma^2I$ for some $\sigma > 0$.

Questions:

  • even in the above simple setting, still a gap between the minimax upper and lower bounds
  • unclear
    1. what the optimal rate of convergence for the minimax misclassification risk
    2. which classification rule is rate optimal under the general Gaussian distribution

Goal of the paper:

  • answer the above questions
  • there is a paucity of methods for inference with incomplete high dimensional data, so
    1. develop optimality theory for high dimensional analysis with incomplete data
    2. construct a data-driven adaptive classifier with theoretical guarantees.

Methodology

Data-driven adaptive classifier for complete data

Linear programming discriminant (LPD) directly estimates the discriminant direction $\beta$ through solving the following optimization problem:

\[\hat\beta_{\text{LPD}}=\arg\min_\beta\{\Vert\beta\Vert_1: \text{ subject to }\Vert\hat\Sigma\beta-(\hat\mu_2-\hat\mu_1)\Vert_\infty\le \lambda_n\}\,.\]

Three drawbacks:

  1. it uses a common constraint $\lambda_n$ for all co-ordinates of $a=\hat\Sigma\beta-(\hat\mu_2-\hat\mu_1)$.
  2. not adaptive in the sense that the tunning parameter $\lambda_n$ is not fully specified and needs to be chosen through an empirical method such as cross-validation.
  3. does not come with theoretical optimality guarantees.

To resolve these drawbacks, the paper introduces an adaptive algorithm for high dimensional LDA with complete data, called LDA.

(…) w.p. at least $1-4/p$, \begin{align} \vert a_j\vert \le 4\sqrt{\frac{\log p}{n}}\sqrt{\sigma_{jj}}\sqrt{\frac{25\Delta^2}{2}+1}\,,\qquad j=1,\ldots,p\,.\label{eq:7} \end{align}

AdaLDA classifier relies on an accurate estimate of the right-hand side of \eqref{eq:7}, where $\sigma_{jj}$ can be easily estimated by the sample variances, but $\Delta^2$ is more difficult to estimate.

Construct a preliminary estimator $\tilde \beta$, estimating $\Delta^2$ by $\vert\tilde \beta^T(\hat\mu_2-\hat\mu_1)\vert$, then apply the above lemma to refine the estimation of $\beta$:

  1. (estimating $\Delta^2$). Construct a preliminary estimator $\tilde \beta$, estimating $\Delta^2$ by $\vert\tilde \beta^T(\hat\mu_2-\hat\mu_1)\vert$
  2. (adaptive estimate $\beta$). Construct through the linear optimization $\hat\beta_{\text{AdaLDA}}=\arg\min_\beta\Vert\beta\Vert_1$.
  3. plug into Fisher’s rule.

Concern related to the implementation of the algorithm.(Resolved)

Given a linear program

\[\min c^Tx,\quad\text{subject to } Ax\le b\,,\]

we can convert the inequality constraints to equalities by introducing a vector of slack variable $z$ and writing

\[\min c^Tx,\quad \text{subject to }Ax+z=b,z\ge 0\,.\]

This form is still not quite standard, since not all the variables are constrained to be nonnegative. We deal with this by splitting $x$ into its nonnegative and nonpositive parts, $x=x^+-x^-$, then we have

\[\min\begin{bmatrix}c\\-c\\0\end{bmatrix}^T\begin{bmatrix}x^+&x^-&z\end{bmatrix}\,, \text{s.t. }\begin{bmatrix}A &-A & I\end{bmatrix}\begin{bmatrix}x^+\\x^-\\z\end{bmatrix}=b, \begin{bmatrix} x^+\\x^-\\z \end{bmatrix}\ge 0\,.\]

Here, the implementation of the algorithms also split $x$. At first, I was wondering whether it is proper. Now, although the space might be enlarger, the solution space still agree with the original problem.

Take a toy example, suppose $\beta = 4-0 = 5-1$, both of them satisfy the constraints, but the object function $4+0 < 5+1$, so the second case is discarded.

ADAM (Adaptive LDA with randomly Missing data)

It is unrelated to the optimizer Adam used in neural networks.

estimate generalize sample mean and generalize sample covariance matrix

(...) with high probability \begin{align} \vert a_j^*\vert \le 4\sqrt{\frac{\log p}{n_\min^*}}\sqrt{\sigma_{jj}}\sqrt{64\Delta^2+1}\,,\qquad j=1,\ldots,p\,. \end{align}

ADAM:

  1. estimate $\beta$ by a preliminary estimator, and then estimate $\Delta^2$.
  2. adaptive estimate $\beta$.
  3. plug into Fisher’s rule.

Extensions to other missingness mechanisms such as missingness not at random is possible but challenging. The paper claims that, the consistency of their algorithm relies only on consistent estimation of the mean vectors and the covariance matrix, thus if the means and the covariance matrix can be estimated consistently under some missingness not at random model, they can construct a consistent classification rule based on these estimators.extensions?

Theoretical properties

Develop an optimality theory for high dimensional LDA for both the complete data and the incomplete data settings.

Theoretical analysis of AdaLDA

Consider the parameter space

\[\cG(s, M_{n,p}) = \{\theta=(\mu_1,\mu_2,\Sigma):\mu_1,\mu_2\in\bbR^p, \Sigma\in\bbR^{p\times p},\Sigma\succ 0, \Vert\beta\Vert_0\le s, M^{-1}\le \lambda_\min(\Sigma)\le\lambda_\max(\Sigma)\le M, M_{n,p}\le \Delta\le 3M_{n,p}\}\,.\]
(...) $$ \sup_{\theta\in\cG(s,M_{n,p})}\bbE[\Vert\hat\beta_{AdaLDA}-\beta\Vert_2]\preceq \sqrt{\frac{s\log p}{n}} $$

Then characterize the accuracy of the classification rule $\hat C_{AdaLDA}$, measured by the excess misclassification risk $R_\theta(\hat C)-R_\opt(\theta)$.

(...) $$ \inf_{\theta\in\cG(s,M_{n,p})}\bbP\left(R_\theta(\hat C_{AdaLDA})-R_\opt(\theta)\le C\frac{s\log p}{n}\right)\ge 1-8p^{-1} $$

Theoretical analysis of ADAM

Under the MCR model, suppose the missingness pattern $S\in\{0,1\}^{n_1\times p}\times \{0,1\}^{n_2\times p}$ is a realization of a distribution $\cF$. Consider the distribution space

\[\Psi(n_0;n,p)=\{\cF:\bbP_{S\sim \cF}\{c_1n_0\le n_\min^*(S)\le c_2n_0\}\ge 1-p^{-1}\}\,.\]

Two theorems talk about $\bbE\Vert \hat\beta_{ADAM}-\beta\Vert_2$ and $R_\theta(\hat C_{ADAM})-R_\opt(\theta)$, respectively.

Minimax lower bounds

Only the results for the missing data setting (compete-data is a special case).

The lower bound results show that the rates of convergence that are obtained by the AdaLDA and ADAM algorithm are indeed optimal, for both estimation of the discriminant direction $\beta$ and classification.

(...)Under the MCR model, the minimax risk of estimating the discriminant direction $\beta$ over the class $G(s,M_{n,p})$ and $\Psi(n_0;n,p)$ satisfies $$ \inf_{\hat\beta}\sup_{\theta\in\cG(s,M_{n,p})\\\cF\in\Psi(n_0;n,p)} \bbE[\Vert\hat\beta-\beta\Vert_2]\gtrsim M_{n,p}\sqrt{\frac{s\log p}{n_0}}\,. $$
(...) $$ \inf_{\hat C}\sup_{\theta\in \cG(s,M_{n,p}),\cF\in\Psi(n_0;n,p)}\left[\bbP\left(R_\theta(\hat C)-R_\opt(\theta)\ge C_\alpha\frac{s\log p}{n_0}\right)\right]\ge 1-\alpha\,. $$

Reduce the loss $R_\theta(\hat C)-R_\opt(\theta)$ to the risk function $L_\theta(\hat C)$.

$$ R_\theta(\hat C)-R_\opt(\theta) \ge \frac{\sqrt{2\pi}\Delta}{8}\exp\left(\frac{\Delta^2}{8}\right)L_\theta^2(\hat C)\,.$$

Published in categories Note