MINE
Posted on 0 Comments
There are my notes when I read the paper called Detecting Novel Associations in Large Data Sets.
Abstract
Identifying interesting relationships between pairs of variables in large data sets
MIC
 captures a wide range of associations both functional and not
 for functional relationships provides a score that roughly equals the coefficient of determination of the data relative to the regression function
explore a large data set
search for pairs of variables that are closely associated
 calculate some measure of dependence for each pair
 rank the pairs by their scores
 examine the topscoring pairs
the statistic we use to measure dependence should have two heuristic properties:
generality

with sufficient sample size the statistic should capture a wide range of interesting associations, not limited to specific function types (such as linear, exponential, or periodic), or even to all functional relationships.

not only do relationships take many functional forms, but many important relationships—for example, a superposition of functions—are not well modeled by a function
equitability.
the statistic should give similar scores to equally noisy relationships of different types.
An equitable statistic should give similar scores to functional relationships with similar $R^2$ values
MIC
 establish MIC’s generality through proofs, show its equitability on functional relationships through simulations
 observe that this translates into intuitively equitable behavior on more general associations
MIC is based on the idea that
MINE
MIC gives rise to a larger family of statistics, which we refer to as MINE, or maximal informationbased nonparametric exploration.
 identify interesting associations
 characterize them according to properties such as nonlinearity and monotonicity
MIC
if a relationship exists between two variables, then a grid can be drawn on the scatterplot of the two variables that partitions the data to encapsulate that relationship.
 explore all grids up to a maximal grid resolution, dependent on the sample size
 computing for every pair of integers (x,y) the largest possible mutual in formation achievable by any xbyy grid applied to the data.
 normalize these mutual information values to ensure a fair comparison between grids of different dimensions and to obtain modified values between 0 and 1.
characteristic matrix $M = (m_{x,y})$, where m_{x,y} is the highest normalized mutual information achieved by any xbyy grid
MIC the maximum value in $M$
G: a grid $I_G$: the mutual information of the probability distribution induced on the boxes of $G$, the probability of a box is proportional to the number of data points falling inside the box
$m_{x,y}=\frac{\mathrm{max} I_G}{\mathrm{log min}(x,y)}$
\[\mathrm{MIC}=\underset{xy<B}{\mathrm{max}}m_{x,y}\] $m_{x,y}\in {0,1}$, so $\mathrm{MIC}\in {0,1}$
 $\mathrm{MIC}(X,Y)=\mathrm{MIC}(Y,X)$
dynamic programming(tree?)
Main properties of MIC
probability approaching 1 as sample size grows,
 MIC assigns scores that tend to 1 to all neverconstant noiseless functional relationships
 MIC assigns scores that tend to 1 for a larger class of noiseless relationships (including superpositions of noiseless functional relationships)
 MIC assigns scores that tend to 0 to statistically independent variables.
tested MIC’s equitability through simulations.
 noiseless functional relationships (i.e., $R^ 2 = 1.0$) receive MIC scores approaching 1.0
 for a large collection of test functions with varied sample sizes, noise levels, and noise models, MIC roughly equals the coefficient of determination $R^2$ relative to each respective noiseless function.
 (???)at reasonable sample sizes, a sinusoidal relationship with a noise level of $R^2 = 0.80$ and a linear relationship with the same $R^2$ value receive nearly the same MIC score.
Comparisons to other methods
An expanded toolkit for exploration
 MAS
 $\mathrm{MIC}\rho^2$ is near 0 for linear relationships and large for nonlinear relationships with high values of MIC.
Application of MINE to real data sets.
 social, economic, health, and political indicators from the World Health Organization (WHO) and its partners
 yeast gene expression profiles from a classic paper reporting genes whose transcript levels vary periodically with the cell cycle
 performance statistics from the 2008 Major League Baseball (MLB) season
noncoexistence When one species is abundant the other is less abundant than expected by chance, and vice versa
notes
encapsulate accentuated axiomatic superposition gut ribosomal distal parabolic plausibility canonical supervene decentralized