LD Score Regression

Posted on Dec 15, 2022 (Update: Feb 07, 2023)

an inflated distribution of test statistics in GWAS can be yielded by

• polygenicity (many small genetic effects)
• confounding biases: such as cryptic relatedness and population stratification

the paper proposed LD Score regression

• quantifies the contribution of each by examining the relationship between test statistics and linkage disequilibrium (LD)
• the LD Score regression intercept can be used to estimate a more powerful and accurate correction factor than genomic control

Under a polygenic model, where effect sizes for variants are drawn independently from distributions with variance proportional to $1/(p(1-p))$, where $p$ is the minor allele frequency (MAF), the expected $\chi^2$ statistic of variant $j$ is

$E[\chi^2\mid \ell_j] = Nh^2\ell_j/M + Na +1\,,$

where

• $N$: sample size
• $M$: number of SNPs, then $h^2/M$ is the average heritability explained per SNP
• $a$: contribution of confounding biases
• $\ell_j = \sum_kr_{jk}^2$: LD score of variant $j$, which measures the amount of genetic variation tagged by $j$

Consequently, if regress $\chi^2$ from GWAS against LD score, the intercept minus one is an estimator of the mean contribution of confounding bias to the inflation in the test statistics.

Model

Model phenotypes as

$\phi = X\beta + \epsilon$

where

• $\phi$: $N\times 1$ vector of (quantitative) phenotypes
• $X$: an $N\times M$ matrix of genotypes normalized to mean zero and variable one ($\sum_{i=1}^NX_{ij} =0, \frac 1N X_j^TX_j=1$)
• $\beta$: per-normalized-genotype effect sizes
• $\bbE[\epsilon]=0, \Var[\epsilon]=(1-h_g^2)I, \bbE[\beta]=0, \Var[\beta] = (h_g^2/M)I$
• assume genotype at variant $j$ for individual $i$ is independent of other individualsâ€™ genotypes
• incorporate linkage disequilibrium: define $r_{jk} = \bbE[X_{ij}X_{ik}]$, which does not depend on $i$
• assume $X, \beta$ and $\epsilon$ are mutually independent

For each variant $j=1,\ldots,M$, compute least-squares estimates of effect size

$\hat\beta_j=(X_j^TX_j)^{-1}(X_j^T\phi)=(X_j^T\phi)/N$

and $\chi_j^2$-statistics $\chi_j^2=N\hat\beta_j^2$.

Define the LD Score of variant $j$ as

$\ell_j = \sum_{k=1}^M r_{jk}^2\,.$

The expected $\chi^2$-statistic of variant $j$ is

$\bbE[\chi_j^2]\approx \frac{Nh_g^2}{M}\ell_j+1$

Firstly,

$\bbE[\chi_j^2] = N\bbE[\hat\beta_j^2] = N\left(\Var[\hat\beta_j] + (\bbE[\hat\beta_j])^2\right)=N\cdot\Var[\hat\beta_j]$

By the law of total variance,

$\Var[\hat\beta_j] = \bbE[\Var[\hat\beta_j\mid X]] + \Var[\bbE[\hat\beta_j\mid X]] = \bbE[\Var[\hat\beta_j\mid X]].$

Note that,

$\Var[\hat\beta_j\mid X] = \frac{1}{N^2}\Var[X_j^T\phi\mid X] = \frac{1}{N^2}X_j^T\Var[\phi\mid X]X_j = \frac{1}{N^2}\left(\frac{h_g^2}{M}X_j^TXX^TX_j+N(1-h_g^2)\right)$

write the first term as

$\frac{1}{N^2}X_j^TXX^TX_j = \sum_{k=1}^M\tilde r_{jk}^2$

where $\tilde r_{jk} = \frac 1N\sum_{i=1}^NX_{ij}X_{ik}$ denotes the sample correlation between additively-coded genotypes at variants $j$ and $k$.

Since

$\bbE[\tilde r_{jk}^2] \approx r_{jk}^2 + (1-r_{jk}^2)/N$

(??????why?????)

then

$\bbE\left[\sum_{k=1}^M\tilde r_{jk}^2\right]\approx \ell_j + \frac{M-\ell_j}{N}$

Thus,

$\bbE[\chi_j^2] \approx N\left(\frac{h_g^2}{M}\left[\ell_j + \frac{M-\ell_j}{N}\right] + \frac 1N(1-h_g^2)\right) = h_g^2\left(\frac{N\ell_j+M-\ell_j}{M} - 1\right) + 1 = \frac{N(1-1/N)h_g^2}{M}\ell_j +1 \approx \frac{Nh_g^2}{M}\ell_j +1$

Published in categories Note