LD Score Regression

Posted on Dec 15, 2022 (Update: Feb 07, 2023)

Tags: Polygenicity, GWAS, Linkage Disequilibrium

This note is for Bulik-Sullivan, B. K., Loh, P.-R., Finucane, H. K., Ripke, S., Yang, J., Patterson, N., Daly, M. J., Price, A. L., & Neale, B. M. (2015). LD Score regression distinguishes confounding from polygenicity in genome-wide association studies. Nature Genetics, 47(3), 291–295.

an inflated distribution of test statistics in GWAS can be yielded by

polygenicity (many small genetic effects)
confounding biases: such as cryptic relatedness and population stratification

the paper proposed LD Score regression

quantifies the contribution of each by examining the relationship between test statistics and linkage disequilibrium (LD)
the LD Score regression intercept can be used to estimate a more powerful and accurate correction factor than genomic control

Under a polygenic model, where effect sizes for variants are drawn independently from distributions with variance proportional to $1/(p(1-p))$, where $p$ is the minor allele frequency (MAF), the expected $\chi^2$ statistic of variant $j$ is

\[E[\chi^2\mid \ell_j] = Nh^2\ell_j/M + Na +1\,,\]

where

$N$: sample size
$M$: number of SNPs, then $h^2/M$ is the average heritability explained per SNP
$a$: contribution of confounding biases
$\ell_j = \sum_kr_{jk}^2$: LD score of variant $j$, which measures the amount of genetic variation tagged by $j$

Consequently, if regress $\chi^2$ from GWAS against LD score, the intercept minus one is an estimator of the mean contribution of confounding bias to the inflation in the test statistics.

Model

Model phenotypes as

\[\phi = X\beta + \epsilon\]

where

$\phi$: $N\times 1$ vector of (quantitative) phenotypes
$X$: an $N\times M$ matrix of genotypes normalized to mean zero and variable one ($\sum_{i=1}^NX_{ij} =0, \frac 1N X_j^TX_j=1$)
$\beta$: per-normalized-genotype effect sizes
$\bbE[\epsilon]=0, \Var[\epsilon]=(1-h_g^2)I, \bbE[\beta]=0, \Var[\beta] = (h_g^2/M)I$
assume genotype at variant $j$ for individual $i$ is independent of other individuals’ genotypes
incorporate linkage disequilibrium: define $r_{jk} = \bbE[X_{ij}X_{ik}]$, which does not depend on $i$
assume $X, \beta$ and $\epsilon$ are mutually independent

For each variant $j=1,\ldots,M$, compute least-squares estimates of effect size

\[\hat\beta_j=(X_j^TX_j)^{-1}(X_j^T\phi)=(X_j^T\phi)/N\]

and $\chi_j^2$-statistics $\chi_j^2=N\hat\beta_j^2$.

Define the LD Score of variant $j$ as

\[\ell_j = \sum_{k=1}^M r_{jk}^2\,.\]

The expected $\chi^2$-statistic of variant $j$ is
\[\bbE[\chi_j^2]\approx \frac{Nh_g^2}{M}\ell_j+1\]

Firstly,

\[\bbE[\chi_j^2] = N\bbE[\hat\beta_j^2] = N\left(\Var[\hat\beta_j] + (\bbE[\hat\beta_j])^2\right)=N\cdot\Var[\hat\beta_j]\]

By the law of total variance,

\[\Var[\hat\beta_j] = \bbE[\Var[\hat\beta_j\mid X]] + \Var[\bbE[\hat\beta_j\mid X]] = \bbE[\Var[\hat\beta_j\mid X]].\]

Note that,

\[\Var[\hat\beta_j\mid X] = \frac{1}{N^2}\Var[X_j^T\phi\mid X] = \frac{1}{N^2}X_j^T\Var[\phi\mid X]X_j = \frac{1}{N^2}\left(\frac{h_g^2}{M}X_j^TXX^TX_j+N(1-h_g^2)\right)\]

write the first term as

\[\frac{1}{N^2}X_j^TXX^TX_j = \sum_{k=1}^M\tilde r_{jk}^2\]

where $\tilde r_{jk} = \frac 1N\sum_{i=1}^NX_{ij}X_{ik}$ denotes the sample correlation between additively-coded genotypes at variants $j$ and $k$.

Since

\[\bbE[\tilde r_{jk}^2] \approx r_{jk}^2 + (1-r_{jk}^2)/N\]

(??????why?????)

then

\[\bbE\left[\sum_{k=1}^M\tilde r_{jk}^2\right]\approx \ell_j + \frac{M-\ell_j}{N}\]

Thus,

\[\bbE[\chi_j^2] \approx N\left(\frac{h_g^2}{M}\left[\ell_j + \frac{M-\ell_j}{N}\right] + \frac 1N(1-h_g^2)\right) = h_g^2\left(\frac{N\ell_j+M-\ell_j}{M} - 1\right) + 1 = \frac{N(1-1/N)h_g^2}{M}\ell_j +1 \approx \frac{Nh_g^2}{M}\ell_j +1\]

Published in categories Note

← previous next →

See all posts →

WeiYa's Work Yard

A traveler with endless curiosity, who fell into the ocean of statistics, tries to write down his ideas and notes to save himself.

LD Score Regression

Posted on Dec 15, 2022 (Update: Feb 07, 2023)

Model