# Jackknife and Mutual Information

##### Posted on Jan 07, 2019

The jackknife is based on Wasserman (2006) and Efron and Hastie (2016), while the Jackknife estimation of Mutual Information is based on Zeng et al. (2018).

## Literatures

### Measurements of dependence between two random variables

• Pearson’s correlation coefficient
• kernel-based independence criterion that uses the squared Hilbert-Schmidt norm of the cross-covariance operator
• distance correlation (dCor)
• rank-based distance measure
• mutual information

### Typical groups of estimation for MI for continuous data

• “bins” method that discretizes the continuous data into different bins and estimates MI from the discretized data
• based on estimates of probability density functions: histogram estimator, kernel density estimator (KDE), B-spline estimator, mirrored KDE, ensemble estimator
• based on the relationship between the MI and entropies: kNN, KSG

#### Drawbacks

• the estimation of MI relies heavily on the choice of the tuning parameters involved such as the number of bins, the bandwidth in kernel density estimation, and the number of neighbors in the kNN estimator.
• little research has been done on the selection of these parameters.

so the paper focus on the KDE approach and claims that bandwidth should be set equal and proposes a jackknife version of MI (JMI).

## MI and Its Kernel Estimation

Consider two random variables $\X = (X_1,\ldots,X_p)’$ and $\Y=(Y_1,\ldots,Y_Q)’$.

properties:

• $MI(\X,\Y)\ge 0$ and equality holds iff the two variables are independent
• the stronger the dependence, the larger is the MI.
• invariant under strictly monotonic variable transformations.
• satisfies the so-called self-equitablility.

Let $S=\{(\x_i,\y_i),i=1,\ldots,n\}$ with $\x_i=(x_{i1},\ldots,x_{iP})’$ and $\y_i=(y_{i1},\ldots,y_{iQ})’$ be independent samples from $(\X,\Y)$. Consider the following multivariate KDEs:

Here $\H_\X$, $\H_\Y$, $\B_\X$ and $\B_\Y$ are diagonal bandwidth matrices, and $\K_\A^P(\x)=\vert \A\vert^{-1/2}\K^P(\A^{-1/2}\x)$. Based on these estimators, the KDE of MI is

The product of frequencies in hypercubes:

and

Taking $\K^P(\x)=\prod_{p=1}^P[\1_{(\vert x_p\vert \le 1)}/2]$, we have

and

Thus, the ratio of comparison is

The paper claims that $\B_\X=\H_\X$ and $\B_\Y=\H_\Y$ should be imposed on the KDE. Then we have

and

## The Jackknife

Let $T_n=T(X_1,\ldots,X_n)$ be an estimator of some quantity $\theta$ and let $\bias(T_n)=\E(T_n)-\theta$. Let $T_{(-i)}$ denote the statistic with the $i$-th observation removed. The jackknife bias estimate is defined by

where $\overline T_n=n^{-1}\sum_iT_{(-i)}$. The bias-corrected estimator is $T_{jack}=T_n-b_{jack}$.

The reason why $b_{jack}$ defined this way is as follows. For many statistics it can be shown that

for some $a$ and $b$. Then we have

Hence, by some calculations (refer to Wasserman (2006) for details),

which shows that $b_{jack}$ estimates the bias up to order $O(n^{-2})$. Similarly,

so the bias of $T_{jack}$ is an order of magnitude smaller than that of $T_n$. $T_{jack}$ can be written as

where $\widetilde T_i$ is also called pseudo-values.

The jackknife estimate of $\Var(T_n)$ is

where

is the sample variance of the pseudo-values. And we can get the jackknife estimate of standard error for $T_n$:

Standard Error of a statistic is the standard deviation of its sampling distribution. For the mean, its standard error is

where $\sigma$ is the standard deviation of the population, and $n$ is the size of the sample.

Efron and Hastie (2016) emphasized some features of the jackknife formula:

• It is nonparametric; no special form of the underlying distribution need be assumed.
• It is completely automatic: a single master algorithm can be written that inputs the data set $\x$ and the function $T(\x)$ for calculating the statistic, and outputs $\hat\se_{jack}$.
• The algorithm works with data sets of size $n-1$, not $n$.
• The jackknife standard error is upwardly biased as an estimate of the true standard error.
• The connection of the jackknife formula with Taylor series methods: $\se_{jack}^2$ is proportional to the sum of squared derivatives of $T(\x)$ in the $n$ component directions.

They also pointed out that the principal weakness of the jackknife is its dependence on local derivatives. Unsmooth statistic $T(X)$ can result in erratic behavior for $\hat\se_{jack}$.

## Jackknife Estimation of MI

Consider uniform transformation:

and

We have $MI(\U,\V)=MI(\X,\Y)$.

A copula is a multivariate probability distribution for which the marginal probability distribution of each variable is uniform.

copula density function of $\U,\V$ and $(\U,\V)$: $c_\U(\u)$, $c_\V(\v)$ and $c_{\U\V}(\u,\v)$.

Jackknife estimation:

where

We estimate by the maximum of $\hat I_2(\B_\U,\B_\V,\H_\U,\H_\V)$.

The superiority of Jackknife?

Its theorem indicates that a kernel-based estimate of a copula density has a bias of the same order as the bandwidth. It is caused by the boundary points since the kernel density estimation has a much consistency rate for the interior points. And it shows that its bias depends on the difference of bandwidths.

Let $\B_\U=\H_\U=h^2\I_P$ and $\B_\V=\H_\U=h^2I_Q$, $\hat I_2$ reduced to $\hat I_3(h)$ and the finial estimator is

Test for independence can be carried out with permutation test.