# Optimal estimation of functionals of high-dimensional mean and covariance matrix

## Abstract

• Estimate a functional $\mu\Sigma^{-1}\mu$ involving both the mean vector $\mu$ and covariance matrix $\Sigma$.
• Study the minimax estimation of the functional in the high-dimensional setting where $\Sigma^{-1}\mu$ is sparse.

## Introduction

Many new theory and methods have been proposed to the challenges arising from high dimensionality.

• various regularization techniques have been proposed to estimate large covariance matrix under different matrix structural assumptions such as sparsity, conditional sparsity, and smoothness.
• two review paper: TODO

Let $\x_i\in\IR^p$ be i.i.d. random vectors with $\bbE(\x_i)=\mu$ and $\Cov(\x_i)=\Sigma$.

Primary goal: estimate the functional $\mu’\Sigma^{-1}\mu$ based on the observations $\{\x_i\}_{i=1}^n$, under the assumption that $\Sigma^{-1}\mu$ (approximately) sparse.

other goal: the optimal rate of convergence for estimating the functional $\mu’\Sigma^{-1}\mu$, reveal the minimax estimation rate of $\mu’\Sigma^{-1}\mu$ in the high-dimensional multivariate problem.

## Preliminaries and Examples

Suppose that $\x_1,\ldots,\x_n$ are independent copies of $\x \in\IR^p$ with $\bbE(\x)=\mu$ and $\Cov(\x)=\Sigma$. Throughout the paper, assume $\x$ is sub-gaussian. That is, $\x = \Sigma^{1/2}y+\mu$ and the zero-mean isotropic random vector $y$ satisfies

$\bbP(\vert c^Ty\vert \ge t)\le 2\exp(-t^2/\nu^2),\quad, \text{for all } t\ge 0, \Vert c\Vert_2=1\,,$

with $\nu > 0$ being a constant.

Study the estimation problem under minimax framework. The central goal is to characterize the minimax rate of the estimation error given by

$\inf_{\hat \theta}\sup_{(\mu,\Sigma)\in\cH} \bbE\vert \hat\theta-\mu'\Sigma^{-1}\mu\vert\,.$
Consider $\cH=\{(\mu,\Sigma):\mu'\Sigma\mu\le c\}$, where $c > 0$ is a fixed constant. If $p \ge n^2$, it holds that $$\inf_{\hat\theta}\sup_{(\mu,\Sigma)\in \cH}\bbE\vert\hat\theta-\mu'\Sigma^{-1}\mu\vert \ge \tilde c\,,$$ where $\tilde c>0$ is a constant that depends on $c$.

It shows that it is impossible to consistently estimate the functional $\mu’\Sigma^{-1}\mu$, under the scaling $p\ge n^2$ which is not common in high-dimensional problems. To overcome the difficulty, we need a more structured parameter space.

However, the authors admit that, it is not clear what kind of simple constraints would make the problem solvable and practical. As part of the contribution, they consider the following parameter spaces with sparsity constraint

$\cH(s,\tau) = \left\{ (\mu, \Sigma): \Vert \Sigma^{-1}\mu\Vert_0 \le s, \mu^T\Sigma^{-1}\mu\le \tau, c_L\le \lambda_\min(\Sigma)\le \delta_\Sigma\le c_U \right\}$

or more generally the approximate sparsity constraint

$\cH_q(R,\tau) = \left\{ (\mu, \Sigma): \Vert \Sigma^{-1}\mu\Vert_q^q\le R, \mu^T\Sigma^{-1}\mu\le \tau, c_L\le \lambda_\min(\Sigma)\le \delta_\Sigma \le c_U \right\}, q\in (0, 1]\,.$

where $\delta_\Sigma=\max_{i\in[p]}\sigma_{ii}$; $s, R$ and $\tau$ are scale with $n$ and $p$.

For notational simplicity, suppress the dependence of $\cH(s,\tau)$ and $\cH_q(R,\tau)$ on $c_L, c_U$.

### The mean-variance portfolio optimization

The Markowitzs theory can be expressed as an optimization problem with the solution determining proportion of each asset in a portfolio by maximizing the expected return under risk constraint, where the risk is measured by the variance of the portfolio.

Specifically, let $\x \in \IR^p$ be the excess return of $p$ risky assets, with $\E\x = \mu, \cov(\x) = \Sigma$. Markowitzs portfolio optimization problem is

$\max_\w \E(\w^T\x) = \w^T\mu \text{ s.t. }\var(\w^T\x) = \w^T\Sigma\w\le \sigma^2\,,$

where $\sigma$ is the prescribed risk level. The optimal portfolio $\w^*$ is

$\w^* = \frac{\sigma}{\sqrt{\mu^T\Sigma^{-1}\mu}}\Sigma^{-1}\mu\,.$

We need both $\mu^T\Sigma^{-1}\mu$ and $\Sigma^{-1}\mu$ in order to construct the optimal portfolio allocation $\w^*$, in which the number of assets $p$ can be large compared to $n$.

### High-dimensional LDA

The classical LDA procedure approximates Fisher’s rule

$\text{classify \x to class 1 if }(\mu_1-\mu_2)^T\Sigma^{-1}(\x - \frac{\mu_1+\mu_2}{2})\ge 0$

by replacing the unknown parameters $\mu_1,\mu_2,\Sigma$ by their sample versions.

However, in the high-dimensional settings, the standard LDA can be no better than random guess. Various high-dimensional LDA approaches have been proposed under the sparsity assumption on $\mu_1-\mu_2$ or $\Sigma$. An alternative approach to sparse linear discriminant analysis imposes sparsity directly on $\Sigma^{-1}(\mu_1-\mu_2)$, based on the key observation that Fisher’s rule depends on $\mu_1-\mu_2$ and $\Sigma$ only through the product $\Sigma^{-1}(\mu_1-\mu_2)$.

Although the functional estimation in the LDA problem looks different from the focused problem, it is possible to extend the results to the LDA setting by a simple adaptation.

Let $\{\x_i\}_{i=1}^{n_1}$ and ${\y_i}_{i=1}^{n_2}$ be two sets of i.i.d. random samples from $N(\mu_1,\Sigma)$ and $N(\mu_2,\Sigma)$ respectively. Define the parameter space

$\cI(s,\tau) = \left\{ (\mu_1,\mu_2,\Sigma): \Vert \Sigma^{-1}(\mu_1-\mu_2)\Vert_0\le s,(\mu_1-\mu_2)^T\Sigma^{-1}(\mu_1-\mu_2)\le \tau, c_L\le \lambda_\min(\Sigma)\le \delta_\Sigma\le c_U \right\}\,.$

Lower bound the minimax error in the two-sample problem by the error from the one-sample problem,

\begin{align} &\inf_{\hat \theta}\sup_{(\mu_1,\mu_2,\Sigma)\in \cI(s,\tau)}\E\vert \hat \theta_{\{\x_i\}_{i=1}^{n_1},\{\y_i\}_{i=1}^{n_2}} - (\mu_1-\mu_2)^T\Sigma^{-1}(\mu_1-\mu_2)\vert\notag\\ \ge & \inf_{\hat \theta}\sup_{(\mu_1,\Sigma)\in \cH(s,\tau)}\E\vert \hat \theta_{\{\x_i\}_{i=1}^{n_1}, \{\y_i\}_{i=1}^{n_2}}-\mu_1^T\Sigma^{-1}\mu_1\vert\tag{set \mu_2=0}\\ \ge & \inf_{\hat \theta}\sup_{(\mu_1,\Sigma)\in \cH(s, \tau)}\E\vert \hat \theta_{\{\x_i\}_{i=1}^{n_1+2n_2}} - \mu_1^T\Sigma^{-1}\mu_1\vert\,.\tag{replace \y_i by \frac{\x_{n_1+2i-1}-\x_{n_1+2i}}{\sqrt 2}} \end{align}

De-biasing ideas for statistical inference in high-dimensional linear models: TODO.

## Minimax estimation of the functional

Let $\alpha = \Sigma^{-1}\mu$ and $\theta = \mu^T\Sigma^{-1}\mu$.

### Optimal estimation over exact sparsity classes

Note that

$\alpha = \argmin_{\beta \in \IR^p} \frac 12 \beta^T\Sigma\beta - \beta^T\mu$

and $\alpha$ is sparse, a natural estimator for the vector $\alpha$ is the $\ell_1$-regularized M-estimator:

$$\tilde \alpha = \argmin_{\Vert \beta\Vert_2\le \gamma}\frac 12 \beta^T\hat \Sigma\beta - \beta^T\hat\mu +\lambda \Vert \beta\Vert_1\,.\label{3.1}$$

a little confused here, the paper used $\in$ instead of $=$, or just a typo, or it wants to emphasize potential multiple solutions?

Estimate the functional $\theta$ by

$\tilde \theta = 2\hat\mu^T\tilde \alpha -\tilde \alpha^T\hat \Sigma \tilde \alpha\,,$

which is motivated by the de-biasing ideas for statistical inference in high-dimensional linear models. Here is the detailed explanation.

To construct a plug-in estimator $\tilde \theta = (\mu^*)^T\alpha^*$ for $\theta = \mu^T\alpha$. The author set $\mu^*=\hat \mu$ and choose a de-biased version of $\tilde \alpha$ for $\alpha^*$. The KKT conditions for \eqref{3.1} imply that

$$\hat\mu-\hat\Sigma\tilde \alpha = \lambda \hat g\,,\label{3.3}$$

where $\hat g$ is a subgradient of $\Vert \beta\Vert_1$ at $\beta =\tilde \alpha$. It follows that

\begin{align*} \tilde \alpha &= \tilde \alpha + \Sigma^{-1}\hat \Sigma\tilde \alpha - \Sigma^{-1}\hat \Sigma\tilde \alpha\\ &= (I_p - \Sigma^{-1}\hat\Sigma)\tilde \alpha + \Sigma^{-1}\hat \Sigma\tilde \alpha\\ &= (I_p - \Sigma^{-1}\hat\Sigma)\tilde \alpha + \Sigma^{-1}\hat \mu - \lambda \Sigma^{-1}\hat g\\ &=(I_p - \Sigma^{-1}\hat\Sigma)(\tilde \alpha - \alpha + \alpha) + \Sigma^{-1}\hat \mu - \lambda \Sigma^{-1}\hat g\\ &= \alpha + \underbrace{\Sigma^{-1}(\hat \mu-\hat \Sigma\alpha)}_{\Delta_1} + \underbrace{(I_p - \Sigma^{-1}\hat\Sigma)(\tilde \alpha - \alpha)}_{\Delta_2} - \lambda \Sigma^{-1}\hat g\\ \end{align*}

The paper said that $\E\Delta_1\approx 0$ (I am not very clear) and $\Delta_2$ is of small order. The major bias term of $\tilde \alpha$ is $-\lambda\Sigma^{-1}\hat g$, which suggests a bias-corrected estimator

$\alpha^* = \tilde \alpha + \lambda \Sigma^{-1}\hat g = \tilde \alpha + \Sigma^{-1}(\hat \mu -\hat \Sigma\tilde \alpha)\,,$

and thus

$\tilde \theta = (\mu^*)^T\alpha^* = \hat \mu^T[\tilde \alpha + \Sigma^{-1}(\hat \mu -\hat\Sigma\tilde \alpha)]\approx 2\hat\mu^T\tilde \alpha - \tilde \alpha^T \hat \Sigma \tilde \alpha$

via replacing $\Sigma^{-1}\hat \mu$ by $\tilde \alpha$ to make $\tilde \theta$ a legitimate estimator.

Set $\lambda =t\nu \sqrt{(1+\tau)\log p/n}, \gamma = 2\sqrt{\tau/c_L}$. If $s\log p/n < \tilde c$, then for all $t > 1 \lor 2/c_2$, $$\sup_{(\mu,\Sigma)\in\cH(s,\tau)}\E \Vert \tilde \alpha - \alpha\Vert_2^2\le \left( \frac{t^2(1+\tau)s\log p}{n} + \tau p^{-(c_2t-1)\land c} \right)\,.$$ Here, $c_1,c_2,c_3 > 0$ are constants possibly depending on $\nu, c_L, c_U$; $c$ is an arbitrary positive constant; $\tilde c\in (0, 1)$ is a constant dependent on $c, \nu, c_L, c_U$ and $\tilde c\rightarrow 0$ as $c\rightarrow \infty$.

The paper said that the two terms appearing in the above bounds might not be optimally derived, and can be simplified to

$\sup_{(\mu,\Sigma)\in\cH(s,\tau)}\E \Vert \tilde \alpha - \alpha\Vert_2^2=O\left(\frac{s\log p}{n}\right)\,.$

Published in categories Memo