WeiYa's Work Yard

A dog, who fell into the ocean of statistics, tries to write down his ideas and notes to save himself.


Posted on 0 Comments
Tags: Bioinformatics

There are my notes when I read the paper called Detecting Novel Associations in Large Data Sets.


Identifying interesting relationships between pairs of variables in large data sets


  1. captures a wide range of associations both functional and not
  2. for functional relationships provides a score that roughly equals the coefficient of determination of the data relative to the regression function

explore a large data set

search for pairs of variables that are closely associated

  1. calculate some measure of dependence for each pair
  2. rank the pairs by their scores
  3. examine the top-scoring pairs

the statistic we use to measure dependence should have two heuristic properties:


  1. with sufficient sample size the statistic should capture a wide range of interesting associations, not limited to specific function types (such as linear, exponential, or periodic), or even to all functional relationships.

  2. not only do relationships take many functional forms, but many important relationships—for example, a superposition of functions—are not well modeled by a function


the statistic should give similar scores to equally noisy relationships of different types.

An equitable statistic should give similar scores to functional relationships with similar $R^2$ values


  1. establish MIC’s generality through proofs, show its equitability on functional relationships through simulations
  2. observe that this translates into intuitively equitable behavior on more general associations

MIC is based on the idea that


MIC gives rise to a larger family of statistics, which we refer to as MINE, or maximal information-based nonparametric exploration.

  1. identify interesting associations
  2. characterize them according to properties such as nonlinearity and monotonicity


if a relationship exists between two variables, then a grid can be drawn on the scatterplot of the two variables that partitions the data to encapsulate that relationship.

  1. explore all grids up to a maximal grid resolution, dependent on the sample size
  2. computing for every pair of integers (x,y) the largest possible mutual in- formation achievable by any x-by-y grid applied to the data.
  3. normalize these mutual information values to ensure a fair comparison between grids of different dimensions and to obtain modified values between 0 and 1.

characteristic matrix $M = (m_{x,y})$, where m_{x,y} is the highest normalized mutual information achieved by any x-by-y grid

MIC the maximum value in $M$

G: a grid $I_G$: the mutual information of the probability distribution induced on the boxes of $G$, the probability of a box is proportional to the number of data points falling inside the box

$m_{x,y}=\frac{\mathrm{max} I_G}{\mathrm{log min}(x,y)}$

  1. $m_{x,y}\in {0,1}$, so $\mathrm{MIC}\in {0,1}$
  2. $\mathrm{MIC}(X,Y)=\mathrm{MIC}(Y,X)$

dynamic programming(tree?)

Main properties of MIC

probability approaching 1 as sample size grows,

  1. MIC assigns scores that tend to 1 to all never-constant noiseless functional relationships
  2. MIC assigns scores that tend to 1 for a larger class of noiseless relationships (including super-positions of noiseless functional relationships)
  3. MIC assigns scores that tend to 0 to statistically independent variables.

tested MIC’s equitability through simulations.

  1. noiseless functional relationships (i.e., $R^ 2 = 1.0$) receive MIC scores approaching 1.0
  2. for a large collection of test functions with varied sample sizes, noise levels, and noise models, MIC roughly equals the coefficient of determination $R^2$ relative to each respective noiseless function.
  3. (???)at reasonable sample sizes, a sinusoidal relationship with a noise level of $R^2 = 0.80$ and a linear relationship with the same $R^2$ value receive nearly the same MIC score.

Comparisons to other methods

An expanded toolkit for exploration

  1. MAS
  2. $\mathrm{MIC}-\rho^2$ is near 0 for linear relationships and large for nonlinear relationships with high values of MIC.

    Application of MINE to real data sets.

  3. social, economic, health, and political indicators from the World Health Organization (WHO) and its partners
  4. yeast gene expression profiles from a classic paper reporting genes whose transcript levels vary periodically with the cell cycle
  5. performance statistics from the 2008 Major League Baseball (MLB) season

noncoexistence When one species is abundant the other is less abundant than expected by chance, and vice versa


encapsulate accentuated axiomatic superposition gut ribosomal distal parabolic plausibility canonical supervene decentralized

Published in categories Note