WeiYa's Work Yard

A dog, who fell into the ocean of statistics, tries to write down his ideas and notes to save himself.

Feature Annealed Independent Rules

Posted on (Update: ) 0 Comments
Tags: Linear Discriminant Analysis, High-Dimensional

This note is based on Fan, J., & Fan, Y. (2008). High-dimensional classification using features annealed independence rules. The Annals of Statistics, 36(6), 2605–2637.

The difficulty of high-dimensional classification is intrinsically caused by the existence of many noise features that do not contribute to the reduction of misclassification rate.

The paper claims that the feature selection is necessary for high-dimensional classification problems. When the independence rule is applied to selected features, the resulting Feature Annealed Independent Rules (FAIR) overcome both the issues of interpretability and the noise accumulation.

Consider the independence classification rule, which classifies the new feature vector $\x$ into class 1 if

\[\delta(\x) = (\x-\mu)'\D^{-1}(\mu_1-\mu_2) > 0\,,\]

where $\mu = (\mu_1+\mu_2) / 2$ and $\D = \diag(\Sigma)$.

The sample version is

\[\hat\delta(\x) = (\x-\hat\mu)'\hat\D^{-1}(\hat\mu_1-\hat\mu_2)\,,\]

where

\[\hat\mu_k = \sum_{i=1}^{n_k}Y_{ki}/n_k\,,\qquad k=1,2,\qquad \hat\mu=(\hat\mu_1+\hat\mu_2)/2\]

and

\[\hat\D = \diag\{(S_{1j}^2+S_{2j}^2)/2,j=1,\ldots,p\}\,,\]

where $S_{kj}^2$ is the sample variance of the $j$-th feature in class $k$.

once assuming same covariance, why not pooled covariance?

To extract salient features, the authors appeal to the two-sample $t$-test statistics. The two-sample $t$-statistics for feature $j$ is defined as

\[T_j = \frac{\bar Y_{1j}-\bar Y_{2j}}{\sqrt{S_{1j}^2/n_1 + S_{2j}^2/n_2}}\,, j=1,\ldots,p\,,\]

then the FAIR takes the following form:

\[\hat\delta_{\text{FAIR}(\x)} = \sum_{j=1}^p\hat\alpha_j(x_j-\hat\mu_j)/\hat\sigma_j^21_{\sqrt{n/(n_1n_2)}\vert T_j\vert > b}\,.\]

The FAIR works the same way as that we first sort the features by the absolute values of their $t$-statistics in the descending order, and then take out the first $m$ features to classify the data.

I reproduce the simulation as follows, but with small misclassification rates, maybe with different parameters, although I followed the setting described in the paper.


Published in categories Memo