WeiYa's Work Yard

A traveler with endless curiosity, who fell into the ocean of statistics, tries to write down his ideas and notes to save himself.

LD Score Regression

Posted on (Update: )
Tags: Polygenicity, Genome-wide Association Studies, Linkage Disequilibrium

This note is for Bulik-Sullivan, B. K., Loh, P.-R., Finucane, H. K., Ripke, S., Yang, J., Patterson, N., Daly, M. J., Price, A. L., & Neale, B. M. (2015). LD Score regression distinguishes confounding from polygenicity in genome-wide association studies. Nature Genetics, 47(3), 291–295.

an inflated distribution of test statistics in GWAS can be yielded by

  • polygenicity (many small genetic effects)
  • confounding biases: such as cryptic relatedness and population stratification

the paper proposed LD Score regression

  • quantifies the contribution of each by examining the relationship between test statistics and linkage disequilibrium (LD)
  • the LD Score regression intercept can be used to estimate a more powerful and accurate correction factor than genomic control

Under a polygenic model, where effect sizes for variants are drawn independently from distributions with variance proportional to $1/(p(1-p))$, where $p$ is the minor allele frequency (MAF), the expected $\chi^2$ statistic of variant $j$ is

\[E[\chi^2\mid \ell_j] = Nh^2\ell_j/M + Na +1\,,\]

where

  • $N$: sample size
  • $M$: number of SNPs, then $h^2/M$ is the average heritability explained per SNP
  • $a$: contribution of confounding biases
  • $\ell_j = \sum_kr_{jk}^2$: LD score of variant $j$, which measures the amount of genetic variation tagged by $j$

Consequently, if regress $\chi^2$ from GWAS against LD score, the intercept minus one is an estimator of the mean contribution of confounding bias to the inflation in the test statistics.

Model

Model phenotypes as

\[\phi = X\beta + \epsilon\]

where

  • $\phi$: $N\times 1$ vector of (quantitative) phenotypes
  • $X$: an $N\times M$ matrix of genotypes normalized to mean zero and variable one ($\sum_{i=1}^NX_{ij} =0, \frac 1N X_j^TX_j=1$)
  • $\beta$: per-normalized-genotype effect sizes
  • $\bbE[\epsilon]=0, \Var[\epsilon]=(1-h_g^2)I, \bbE[\beta]=0, \Var[\beta] = (h_g^2/M)I$
  • assume genotype at variant $j$ for individual $i$ is independent of other individuals’ genotypes
  • incorporate linkage disequilibrium: define $r_{jk} = \bbE[X_{ij}X_{ik}]$, which does not depend on $i$
  • assume $X, \beta$ and $\epsilon$ are mutually independent

For each variant $j=1,\ldots,M$, compute least-squares estimates of effect size

\[\hat\beta_j=(X_j^TX_j)^{-1}(X_j^T\phi)=(X_j^T\phi)/N\]

and $\chi_j^2$-statistics $\chi_j^2=N\hat\beta_j^2$.

Define the LD Score of variant $j$ as

\[\ell_j = \sum_{k=1}^M r_{jk}^2\,.\]

The expected $\chi^2$-statistic of variant $j$ is

\[\bbE[\chi_j^2]\approx \frac{Nh_g^2}{M}\ell_j+1\]

Firstly,

\[\bbE[\chi_j^2] = N\bbE[\hat\beta_j^2] = N\left(\Var[\hat\beta_j] + (\bbE[\hat\beta_j])^2\right)=N\cdot\Var[\hat\beta_j]\]

By the law of total variance,

\[\Var[\hat\beta_j] = \bbE[\Var[\hat\beta_j\mid X]] + \Var[\bbE[\hat\beta_j\mid X]] = \bbE[\Var[\hat\beta_j\mid X]].\]

Note that,

\[\Var[\hat\beta_j\mid X] = \frac{1}{N^2}\Var[X_j^T\phi\mid X] = \frac{1}{N^2}X_j^T\Var[\phi\mid X]X_j = \frac{1}{N^2}\left(\frac{h_g^2}{M}X_j^TXX^TX_j+N(1-h_g^2)\right)\]

write the first term as

\[\frac{1}{N^2}X_j^TXX^TX_j = \sum_{k=1}^M\tilde r_{jk}^2\]

where $\tilde r_{jk} = \frac 1N\sum_{i=1}^NX_{ij}X_{ik}$ denotes the sample correlation between additively-coded genotypes at variants $j$ and $k$.

Since

\[\bbE[\tilde r_{jk}^2] \approx r_{jk}^2 + (1-r_{jk}^2)/N\]

(??????why?????)

then

\[\bbE\left[\sum_{k=1}^M\tilde r_{jk}^2\right]\approx \ell_j + \frac{M-\ell_j}{N}\]

Thus,

\[\bbE[\chi_j^2] \approx N\left(\frac{h_g^2}{M}\left[\ell_j + \frac{M-\ell_j}{N}\right] + \frac 1N(1-h_g^2)\right) = h_g^2\left(\frac{N\ell_j+M-\ell_j}{M} - 1\right) + 1 = \frac{N(1-1/N)h_g^2}{M}\ell_j +1 \approx \frac{Nh_g^2}{M}\ell_j +1\]

Published in categories Note