WeiYa's Work Yard

A dog, who fell into the ocean of statistics, tries to write down his ideas and notes to save himself.

Sparse LDA

Posted on (Update: )
Tags: Linear Discriminant Analysis, High-Dimensional

This note is based on Shao, J., Wang, Y., Deng, X., & Wang, S. (2011). Sparse linear discriminant analysis by thresholding for high dimensional data. The Annals of Statistics, 39(2), 1241–1265.

LDA works well for fixed-$p$-large-$n$ situations and is asymptotically optimal in the sense that, when $n$ increases to infinity, its misclassification rate over that of the optimal rule converges to one.

The paper shows that, the LDA is still asymptotically optimal when $p$ diverges to infinity at a rate slower than $\sqrt n$.

Bickel and Levina (2004) showed that the LDA is asymptotically as bad as random guessing when $p > n$.

Main purpose of the paper: construct a sparse LDA and show it is asymptotically optimal under some sparsity conditions on unknown parameters and some condition on the divergence rate of $p$.

Optimal rule and LDA

Let $\x\sim N_p(\mu_k,\Sigma), k=1,2$, where $\mu_1\neq \mu_2$.

\[R_{\text{OPT}}=\Phi(-\Delta_p/2)\,,\qquad \Delta_p = \sqrt{\delta'\Sigma^{-1}\delta}\,,\]

where $\delta’=\mu_1-\mu_2$.

Assume the following regularity conditions: there is a constant $c_0$ such that

\[c_0^{-1}\le \text{all eigenvalues of }\Sigma\le c_0\]


\[c_0^{-1}\le \max_{j\le p}\delta_j^2\le c_0\,,\]

why this condition

Under the above two conditions, $\Delta_p\ge c_0^{-1}$, and hence $R_{\text{OPT}}\le \Phi(-(2c_0)^{-1}) < 1/2$. Also, $\Delta_p^2 = O(\Vert \delta\Vert^2)$ and $\Vert \delta\Vert^2 = O(\Delta_p^2)$ so that the rate of $\Vert\delta\Vert^2\rightarrow \infty$ is the same as the rate of $\Delta_p^2\rightarrow \infty$.

Training sample $\X=\{\x_{ki},i=1,\ldots,n_k,k=1,2\}$. For a classification rule $T$ constructed using the training sample, its performance can be assessed by the conditional misclassification rate $R_T(\X)$ defined as the average of the conditional probabilities of making two types of misclassification, where the conditional probabilities are with respect to $\x$, given the training sample $\X$.

The asymptotic performance of $T$ refers to the limiting behavior of $R_T(\X)$ or $R_T$ as $n\rightarrow \infty$. Since $0\le R_T(\X)\le 1$, by the dominated convergence theorem, if $R_T(\X)\rightarrow_p c$, then $R_T\rightarrow_p c$. Hence, the paper focus on the limiting behavior of the conditional misclassification rate $R_T(\X)$.

Hope to find a rule $T$ such that $R_T(\X)$ converges in probability to the same limit as $R_{\text{OPT}}$. If $R_{\text{OPT}}\rightarrow 0$, we hope not only $R_T(\X)\rightarrow_p 0$, but also $R_T(\X)$ and $R_{\text{OPT}}$ have the same convergence rate.


  • $T$ is asymptotically optimal if $R_T(\X)/R_{\text{OPT}} \rightarrow_p 1$.
  • $T$ is asymptotically sub-optimal if $R_T(\X)/R_{\text{OPT}}\rightarrow_p 0$.
  • $T$ is asymptotically worse if $R_T(\X)\rightarrow 1/2$.

Sparse LDA

Focus on the situation where the limit of $p/n$ is positive or $\infty$.

A sparsity measure on $\Sigma$:

\[C_{h,p} = \max_{j\le p}\sum_{l=1}^p\vert \sigma_{jl}\vert^h\,.\]

In the special case of $h=0$, $C_{0,p}$ is the maximum of the numbers of nonzero elements of rows of $\Sigma$ so that a $C_{0,p}$ much smaller than $p$ implies many elements of $\Sigma$ are equal to zero. If $C_{h,p}$ is much smaller than $p$ for a constant $h\in(0, 1)$, then $\Sigma$ is sparse in the sense that many elements of $\Sigma$ are very small. Threshold the off-diagonal elements at $t_n=M_1\sqrt{\log p}/\sqrt{n}$.

Consider the following measure on $\delta$ that is similar to the sparsity measure $C_{h,p}$ on $\Sigma$:

\[D_{g,p} = \sum_{j=1}^p\delta_j^{2g}\,.\]

Consider the sparse estimator $\tilde \delta$ that is $\hat\delta$ thresholded at

\[a_n = M_2\left(\frac{\log p}{n}\right)^\alpha\,.\]


The dataset of human acute leukemia data has been seen many times, such as in Fan and Fan (2008)’s FAIR and Section 18.4 of ESL. The original paper would be Golub et al. (1999), which classify acute myeloid leukemia (AML, 急性粒细胞白血病) and acute lymphoblastic leukemia (ALL, 急性淋巴细胞白血病). The dataset contains the expression levels of $p=7129$ genes for $n=72$ patients, in which $n_1=47$ are from the ALL group, and $n_2=25$ are from the AML group. Actually, the original dataset has been divided into training set ($n=38$) and test set ($n=34$), but it seems that the paper had not used such information.

Reproduce some results for human acute leukemia data in the paper. For simplicity, I had not written the cross-validation part, just directly take the same $M_1$ and $M_2$ for the train data, and the misclassification error is 0.0294 in the test data.

Particular $M_1$ and $M_2$

Set $M_1=10^7$ and $M_2=300$, which exactly equal to the one minimized the leave-one-out cross-validation score in the paper, although there is a slightly difference, herein I used the training set, while the paper does not separate the training set and testing set.

function thresholding(δ::Array{Float64, 1}, S::Array{Float64, 2}, n::Int; M1 = 1e7, M2 = 300)
    p = length(δ)
    tn = M1 * (log(p) / n)^0.5
    an = M2 * (log(p) / n)^0.3
    δ_s = copy(δ)
    for i = 1:p
        if δ[i] < an
            δ_s[i] = 0
    S_s = copy(S)
    for i = 1:p
        for j = 1:i-1
            if S[i, j] < tn
                S_s[i, j] = 0
                S_s[j, i] = 0
    return δ_s, S_s

function classify(x::Array{Float64, 2}, δ::Array{Float64, 1}, Σ::Array{Float64, 2}, μ::Array{Float64, 1})
    n = size(x, 1)
    cl = zeros(Int, n)
    for i = 1:n
        if δ' * inv(Σ) * (x[i,:] - μ) > 0
            cl[i] = 1
            cl[i] = 2
    return cl

function SLDA(X::Array{Float64, 2}, y::Array{Int, 1}, X_test::Array{Float64, 2}, y_test::Array{Int, 1})
    n, p = size(X)
    X1 = X[y .== 1, :]
    X2 = X[y .== 2, :]
    n1 = sum(y .== 1)
    n2 = sum(y .== 2)
    if n1 + n2 != n
        error("Dimensions should match.")
    # calculate the mean
    μ1 = mean(X1, dims = 1)
    μ2 = mean(X2, dims = 1)
    μ = (μ1[:] + μ2[:]) / 2
    δ = μ1[:] - μ2[:]
    # calculate the covariance
    S1 = cov(X1)
    S2 = cov(X2)
    S = (n1 * S1 + n2 * S2) / n
    cl = classify(X_test, δ, S, μ)
    return sum(cl != y_test) / length(cl)

SLDA(train_X, train_y, test_X, test_y)

Refer to SLDA/script.jl for more detailed code.

Published in categories Memo