MINE

Posted on Mar 17, 2017

Tags: Bioinformatics

There are my notes when I read the paper called Detecting Novel Associations in Large Data Sets.

Abstract

Identifying interesting relationships between pairs of variables in large data sets

MIC

captures a wide range of associations both functional and not
for functional relationships provides a score that roughly equals the coefficient of determination of the data relative to the regression function

explore a large data set

search for pairs of variables that are closely associated

calculate some measure of dependence for each pair
rank the pairs by their scores
examine the top-scoring pairs

the statistic we use to measure dependence should have two heuristic properties:

generality

with sufficient sample size the statistic should capture a wide range of interesting associations, not limited to specific function types (such as linear, exponential, or periodic), or even to all functional relationships.
not only do relationships take many functional forms, but many important relationships—for example, a superposition of functions—are not well modeled by a function

equitability.

the statistic should give similar scores to equally noisy relationships of different types.

An equitable statistic should give similar scores to functional relationships with similar $R^2$ values

MIC

establish MIC’s generality through proofs, show its equitability on functional relationships through simulations
observe that this translates into intuitively equitable behavior on more general associations

MIC is based on the idea that

MINE

MIC gives rise to a larger family of statistics, which we refer to as MINE, or maximal information-based nonparametric exploration.

identify interesting associations
characterize them according to properties such as nonlinearity and monotonicity

MIC

if a relationship exists between two variables, then a grid can be drawn on the scatterplot of the two variables that partitions the data to encapsulate that relationship.

explore all grids up to a maximal grid resolution, dependent on the sample size
computing for every pair of integers (x,y) the largest possible mutual in- formation achievable by any x-by-y grid applied to the data.
normalize these mutual information values to ensure a fair comparison between grids of different dimensions and to obtain modified values between 0 and 1.

characteristic matrix $M = (m_{x,y})$, where m_{x,y} is the highest normalized mutual information achieved by any x-by-y grid

MIC the maximum value in $M$

G: a grid $I_G$: the mutual information of the probability distribution induced on the boxes of $G$, the probability of a box is proportional to the number of data points falling inside the box

$m_{x,y}=\frac{\mathrm{max} I_G}{\mathrm{log min}(x,y)}$

\[\mathrm{MIC}=\underset{xy<B}{\mathrm{max}}m_{x,y}\]

$m_{x,y}\in {0,1}$, so $\mathrm{MIC}\in {0,1}$
$\mathrm{MIC}(X,Y)=\mathrm{MIC}(Y,X)$

dynamic programming(tree?)

Main properties of MIC

probability approaching 1 as sample size grows,

MIC assigns scores that tend to 1 to all never-constant noiseless functional relationships
MIC assigns scores that tend to 1 for a larger class of noiseless relationships (including super-positions of noiseless functional relationships)
MIC assigns scores that tend to 0 to statistically independent variables.

tested MIC’s equitability through simulations.

noiseless functional relationships (i.e., $R^ 2 = 1.0$) receive MIC scores approaching 1.0
for a large collection of test functions with varied sample sizes, noise levels, and noise models, MIC roughly equals the coefficient of determination $R^2$ relative to each respective noiseless function.
(???)at reasonable sample sizes, a sinusoidal relationship with a noise level of $R^2 = 0.80$ and a linear relationship with the same $R^2$ value receive nearly the same MIC score.

Comparisons to other methods

An expanded toolkit for exploration

MAS
$\mathrm{MIC}-\rho^2$ is near 0 for linear relationships and large for nonlinear relationships with high values of MIC.
Application of MINE to real data sets.
social, economic, health, and political indicators from the World Health Organization (WHO) and its partners
yeast gene expression profiles from a classic paper reporting genes whose transcript levels vary periodically with the cell cycle
performance statistics from the 2008 Major League Baseball (MLB) season

noncoexistence When one species is abundant the other is less abundant than expected by chance, and vice versa

notes

encapsulate accentuated axiomatic superposition gut ribosomal distal parabolic plausibility canonical supervene decentralized

Published in categories Note

← previous next →

See all posts →

WeiYa's Work Yard

A traveler with endless curiosity, who fell into the ocean of statistics, tries to write down his ideas and notes to save himself.

MINE

Posted on Mar 17, 2017

Abstract

MIC

explore a large data set

generality

equitability.

MIC

MINE

MIC

Main properties of MIC

tested MIC’s equitability through simulations.

Comparisons to other methods

An expanded toolkit for exploration

Application of MINE to real data sets.

notes