LD Score Regression
Posted on (Update: )
an inflated distribution of test statistics in GWAS can be yielded by
- polygenicity (many small genetic effects)
- confounding biases: such as cryptic relatedness and population stratification
the paper proposed LD Score regression
- quantifies the contribution of each by examining the relationship between test statistics and linkage disequilibrium (LD)
- the LD Score regression intercept can be used to estimate a more powerful and accurate correction factor than genomic control
Under a polygenic model, where effect sizes for variants are drawn independently from distributions with variance proportional to $1/(p(1-p))$, where $p$ is the minor allele frequency (MAF), the expected $\chi^2$ statistic of variant $j$ is
\[E[\chi^2\mid \ell_j] = Nh^2\ell_j/M + Na +1\,,\]where
- $N$: sample size
- $M$: number of SNPs, then $h^2/M$ is the average heritability explained per SNP
- $a$: contribution of confounding biases
- $\ell_j = \sum_kr_{jk}^2$: LD score of variant $j$, which measures the amount of genetic variation tagged by $j$
Consequently, if regress $\chi^2$ from GWAS against LD score, the intercept minus one is an estimator of the mean contribution of confounding bias to the inflation in the test statistics.
Model
Model phenotypes as
\[\phi = X\beta + \epsilon\]where
- $\phi$: $N\times 1$ vector of (quantitative) phenotypes
- $X$: an $N\times M$ matrix of genotypes normalized to mean zero and variable one ($\sum_{i=1}^NX_{ij} =0, \frac 1N X_j^TX_j=1$)
- $\beta$: per-normalized-genotype effect sizes
- $\bbE[\epsilon]=0, \Var[\epsilon]=(1-h_g^2)I, \bbE[\beta]=0, \Var[\beta] = (h_g^2/M)I$
- assume genotype at variant $j$ for individual $i$ is independent of other individuals’ genotypes
- incorporate linkage disequilibrium: define $r_{jk} = \bbE[X_{ij}X_{ik}]$, which does not depend on $i$
- assume $X, \beta$ and $\epsilon$ are mutually independent
For each variant $j=1,\ldots,M$, compute least-squares estimates of effect size
\[\hat\beta_j=(X_j^TX_j)^{-1}(X_j^T\phi)=(X_j^T\phi)/N\]and $\chi_j^2$-statistics $\chi_j^2=N\hat\beta_j^2$.
Define the LD Score of variant $j$ as
\[\ell_j = \sum_{k=1}^M r_{jk}^2\,.\]The expected $\chi^2$-statistic of variant $j$ is
\[\bbE[\chi_j^2]\approx \frac{Nh_g^2}{M}\ell_j+1\]
Firstly,
\[\bbE[\chi_j^2] = N\bbE[\hat\beta_j^2] = N\left(\Var[\hat\beta_j] + (\bbE[\hat\beta_j])^2\right)=N\cdot\Var[\hat\beta_j]\]By the law of total variance,
\[\Var[\hat\beta_j] = \bbE[\Var[\hat\beta_j\mid X]] + \Var[\bbE[\hat\beta_j\mid X]] = \bbE[\Var[\hat\beta_j\mid X]].\]Note that,
\[\Var[\hat\beta_j\mid X] = \frac{1}{N^2}\Var[X_j^T\phi\mid X] = \frac{1}{N^2}X_j^T\Var[\phi\mid X]X_j = \frac{1}{N^2}\left(\frac{h_g^2}{M}X_j^TXX^TX_j+N(1-h_g^2)\right)\]write the first term as
\[\frac{1}{N^2}X_j^TXX^TX_j = \sum_{k=1}^M\tilde r_{jk}^2\]where $\tilde r_{jk} = \frac 1N\sum_{i=1}^NX_{ij}X_{ik}$ denotes the sample correlation between additively-coded genotypes at variants $j$ and $k$.
Since
\[\bbE[\tilde r_{jk}^2] \approx r_{jk}^2 + (1-r_{jk}^2)/N\](??????why?????)
then
\[\bbE\left[\sum_{k=1}^M\tilde r_{jk}^2\right]\approx \ell_j + \frac{M-\ell_j}{N}\]Thus,
\[\bbE[\chi_j^2] \approx N\left(\frac{h_g^2}{M}\left[\ell_j + \frac{M-\ell_j}{N}\right] + \frac 1N(1-h_g^2)\right) = h_g^2\left(\frac{N\ell_j+M-\ell_j}{M} - 1\right) + 1 = \frac{N(1-1/N)h_g^2}{M}\ell_j +1 \approx \frac{Nh_g^2}{M}\ell_j +1\]