WeiYa's Work Yard

A dog, who fell into the ocean of statistics, tries to write down his ideas and notes to save himself.

Jackknife and Mutual Information

Posted on (Update: ) 0 Comments
Tags: Mutual Information, Jackknife, Copula

In this note, the material about Jackknife is based on Wasserman (2006) and Efron and Hastie (2016), while the Jackknife estimation of Mutual Information is based on Zeng et al. (2018).


Measurements of dependence between two random variables

  • Pearson’s correlation coefficient
  • kernel-based independence criterion that uses the squared Hilbert-Schmidt norm of the cross-covariance operator
  • distance correlation (dCor)
  • rank-based distance measure
  • mutual information

Typical groups of estimation for MI for continuous data

  • “bins” method that discretizes the continuous data into different bins and estimates MI from the discretized data
  • based on estimates of probability density functions: histogram estimator, kernel density estimator (KDE), B-spline estimator, mirrored KDE, ensemble estimator
  • based on the relationship between the MI and entropies: kNN, KSG


  • the estimation of MI relies heavily on the choice of the tuning parameters involved such as the number of bins, the bandwidth in kernel density estimation, and the number of neighbors in the kNN estimator.
  • little research has been done on the selection of these parameters.

so the paper focus on the KDE approach and claims that bandwidth should be set equal and proposes a jackknife version of MI (JMI).

MI and Its Kernel Estimation

Consider two random variables $\X = (X_1,\ldots,X_p)’$ and $\Y=(Y_1,\ldots,Y_Q)’$.



  • $MI(\X,\Y)\ge 0$ and equality holds iff the two variables are independent
  • the stronger the dependence, the larger is the MI.
  • invariant under strictly monotonic variable transformations.
  • satisfies the so-called self-equitablility.

Let $S=\{(\x_i,\y_i),i=1,\ldots,n\}$ with $\x_i=(x_{i1},\ldots,x_{iP})’$ and $\y_i=(y_{i1},\ldots,y_{iQ})’$ be independent samples from $(\X,\Y)$. Consider the following multivariate KDEs:

\[\begin{align} \hat f_{\X,\H_\X}(\x) &= \frac 1n \sum_{j=1}^n\K_{\H_X}^P(\x_j-\x)\\ \hat f_{\Y,\H_\Y}(\y) &= \frac 1n \sum_{j=1}^n\K_{\H_Y}^Q(\y_j-\y)\\ \hat f_{\X\Y,\B_\X,\B_\Y}(\x,\y)&=\frac 1n\sum_{j=1}^n\K_{\B_X}^P(\x_j-\x)\K_{\B_Y}^Q(\y_j-\y) \end{align}\]

Here $\H_\X$, $\H_\Y$, $\B_\X$ and $\B_\Y$ are diagonal bandwidth matrices, and $\K_\A^P(\x)=\vert \A\vert^{-1/2}\K^P(\A^{-1/2}\x)$. Based on these estimators, the KDE of MI is

\[\hat I_1(\B_\X,\B_\Y,\H_\X,\H_\Y) = \frac 1n\sum_{i=1}^n\log\frac{\hat f_{\X\Y,\B_\X,\B_\Y}(\x_i,\y_i)}{\hat f_{\X,\B_\X}(\x_i)\hat f_{\Y,\H_\Y}(\y_i)}\]

The product of frequencies in hypercubes:

\[\cD_\X = \{\x_i:\vert x_{ip}-x_p\vert < h_{X_p},p=1,2,\ldots,P\}\]


\[\cD_\Y = \{\y_i:\vert y_{iq}-y_q\vert < h_{Y_q}, q=1,2,\ldots,Q\}\]

Taking $\K^P(\x)=\prod_{p=1}^P[\1_{(\vert x_p\vert \le 1)}/2]$, we have

\[\begin{align*} \frac{\# \cD_X}{n} &= \frac{2^P}{n}\sum_{i=1}^n\prod_{p=1}^PK(\frac{x_{ip}-x_p}{h_{X_p}}) = 2^P\vert \H_\X\vert^{1/2}\hat f_{\X,\H_\X}(\x)\\ \frac{\# \cD_Y}{n} &= \frac{2^Q}{n}\sum_{i=1}^n\prod_{q=1}^QK(\frac{y_{iq}-y_q}{h_{Y_q}}) = 2^Q\vert \H_\Y\vert^{1/2}\hat f_{\Y,\H_\Y}(\y) \end{align*}\]


\[\frac{\#(\cD_X\cap\cD_Y)}{n} = \frac{2^{P+Q}}{n}\sum_{i=1}^n\prod_{p=1}^PK(\frac{x_{ip}-x_p}{h_{X_p}})\prod_{q=1}^QK(\frac{y_{iq}-y_q}{h_{Y_q}}) = 2^Q\vert \H_\Y\vert^{1/2}\hat f_{\Y,\H_\Y}(\y)\,.\]

Thus, the ratio of comparison is

\[\frac{\#(\cD_\X\cap\cD_\Y)/n}{\#\cD_X/n\#\cD_Y/n} =\frac{\hat f_{\X\Y,\H_\X,\H_\Y}(\x,\y)}{\hat f_{\X,\H_\X}(\x)\hat f_{\Y,\H_\Y}(\y)}\]

The paper claims that $\B_\X=\H_\X$ and $\B_\Y=\H_\Y$ should be imposed on the KDE. Then we have

\[\int \hat f_{\X\Y,\H_\X,\H_\Y}(\x,\y)d\x = \hat f_{\Y,\H_\Y}(\y)\]


\[\int \hat f_{\X\Y,\H_\X,\H_\Y}(\x,\y)d\y = \hat f_{\X,\H_\X}(\x)\]

The Jackknife

Let $T_n=T(X_1,\ldots,X_n)$ be an estimator of some quantity $\theta$ and let $\bias(T_n)=\E(T_n)-\theta$. Let $T_{(-i)}$ denote the statistic with the $i$-th observation removed. The jackknife bias estimate is defined by

\[b_{jack}=(n-1)(\overline T_n-T_n)\]

where $\overline T_n=n^{-1}\sum_iT_{(-i)}$. The bias-corrected estimator is $T_{jack}=T_n-b_{jack}$.

The reason why $b_{jack}$ defined this way is as follows. For many statistics it can be shown that

\[\bias(T_n) = \frac an+\frac{b}{n^2}+O(\frac{1}{n^3})\]

for some $a$ and $b$. Then we have

\[\bias(T_{(-i)}) = \frac{a}{n-1} + \frac{b}{(n-1)^2} + O(\frac{1}{n^3})\,.\]

Hence, by some calculations (refer to Wasserman (2006) for details),

\[\E(b_{jack}) = \bias(T_n) + O(\frac{1}{n^2})\,,\]

which shows that $b_{jack}$ estimates the bias up to order $O(n^{-2})$. Similarly,

\[\bias(T_{jack}) = -\frac{b}{n(n-1)} + O(\frac{1}{n^2}) = O(\frac{1}{n^2})\]

so the bias of $T_{jack}$ is an order of magnitude smaller than that of $T_n$. $T_{jack}$ can be written as

\[T_{jack} = \frac 1n\sum_{i=1}^n\Big( nT_n - (n-1)T_{(-i)} \Big) \triangleq \frac 1n\sum_{i=1}^n\widetilde T_i\,,\]

where $\widetilde T_i$ is also called pseudo-values.

The jackknife estimate of $\Var(T_n)$ is

\[v_{jack}=\frac{\tilde s^2}{n}\,,\]


\[\tilde s^2 = \frac{\sum_{i=1}^n(\widetilde T_i-\frac 1n\sum_{i=1}^n\widetilde{T}_i)^2}{n-1} = (n-1)\sum_{i=1}^n\Big[T_{(-i)}-\frac 1n\sum_{i=1}^nT_{(-i)}\Big]^2\]

is the sample variance of the pseudo-values. And we can get the jackknife estimate of standard error for $T_n$:

\[\hat\se_{jack} = \Big[\frac{n-1}{n}\sum_{i=1}^n\big(T_{(-i)}-\frac 1n\sum_{i=1}^nT_{(-i)}\big)^2\Big]^{1/2}\]

Standard Error of a statistic is the standard deviation of its sampling distribution. For the mean, its standard error is

\[\sigma_{\bar x} = \frac{\sigma}{\sqrt n}\]

where $\sigma$ is the standard deviation of the population, and $n$ is the size of the sample.

Efron and Hastie (2016) emphasized some features of the jackknife formula:

  • It is nonparametric; no special form of the underlying distribution need be assumed.
  • It is completely automatic: a single master algorithm can be written that inputs the data set $\x$ and the function $T(\x)$ for calculating the statistic, and outputs $\hat\se_{jack}$.
  • The algorithm works with data sets of size $n-1$, not $n$.
  • The jackknife standard error is upwardly biased as an estimate of the true standard error.
  • The connection of the jackknife formula with Taylor series methods: $\se_{jack}^2$ is proportional to the sum of squared derivatives of $T(\x)$ in the $n$ component directions.

They also pointed out that the principal weakness of the jackknife is its dependence on local derivatives. Unsmooth statistic $T(X)$ can result in erratic behavior for $\hat\se_{jack}$.

Jackknife Estimation of MI

Consider uniform transformation:

\[\U = (U_1,\ldots,U_P)' = (F_{X_1}(x_1),\ldots,F_{X_p}(X_p))'\]


\[\V = (V_1,\ldots,V_p)' = (F_{Y_1}(y_1),\ldots,F_{Y_p}(X_p))'\]

We have $MI(\U,\V)=MI(\X,\Y)$.

A copula is a multivariate probability distribution for which the marginal probability distribution of each variable is uniform.

copula density function of $\U,\V$ and $(\U,\V)$: $c_\U(\u)$, $c_\V(\v)$ and $c_{\U\V}(\u,\v)$.

There is an intuitive and visual guide to copulas, and an interesting Chinese blog.

简单来说,它可以看作是一种连接函数,即把随机变量的联合分布与其各自的边缘分布连接起来。用一个简单的式子表达,就是 $F_{XY}(x,y)=C(F_X(x),F_Y(y))$,其中的 $C( , )$ 就是传说中的 Copula。

\(\begin{align*} P(U\le u, V\le v)&=P(F_X(X)\le u, F_Y(Y)\le v)=P(X\le F_X^{-1}(u),Y\le F_Y^{-1}(v))\\ &=F_{XY}(F_X^{-1}(u),F_Y^{-1}(v))=C(u,v) \end{align*}\) which implies that the Copula function is exactly the joint distribution of $U$ and $V$, where $U\sim U[0,1]$ and $V\sim [0,1]$.

Jackknife estimation:

\[\hat I_2(\B_\U,\B_\V,\H_\U,\H_\V) = \frac{1}{n}\sum_{i=1}^n\log\frac{\hat c_{\U\V,\B_\U,\B_\V}^{\\i}(\u_i^*,\v_i^*)}{\hat c_{\U,\H_\U}^{\\i}(\u_i^*)\hat c_{\V,\H_\V}^{\\i}(\v_i^*)}\]


\[\hat c_{\U,\H_\U}^{\\i}(\u) = \frac{1}{n-1}\sum_{j\neq i}\K_{\H_\U}^P(\u_j^*-\u)\]

We estimate by the maximum of $\hat I_2(\B_\U,\B_\V,\H_\U,\H_\V)$.

The superiority of Jackknife?

Its theorem indicates that a kernel-based estimate of a copula density has a bias of the same order as the bandwidth. It is caused by the boundary points since the kernel density estimation has a much consistency rate for the interior points. And it shows that its bias depends on the difference of bandwidths.

Let $\B_\U=\H_\U=h^2\I_P$ and $\B_\V=\H_\U=h^2I_Q$, $\hat I_2$ reduced to $\hat I_3(h)$ and the finial estimator is

\[\widehat{JMI}(\X,\Y)=\underset{h>0}{\max}[\hat I_3(h)]\,.\]

Test for independence can be carried out with permutation test.


Published in categories Note