# LD Score Regression

##### Posted on (Update: )

an inflated distribution of test statistics in GWAS can be yielded by

- polygenicity (many small genetic effects)
- confounding biases: such as cryptic relatedness and population stratification

the paper proposed LD Score regression

- quantifies the contribution of each by examining the relationship between test statistics and linkage disequilibrium (LD)
- the LD Score regression intercept can be used to estimate a more powerful and accurate correction factor than genomic control

Under a polygenic model, where effect sizes for variants are drawn independently from distributions with variance proportional to $1/(p(1-p))$, where $p$ is the minor allele frequency (MAF), the expected $\chi^2$ statistic of variant $j$ is

\[E[\chi^2\mid \ell_j] = Nh^2\ell_j/M + Na +1\,,\]where

- $N$: sample size
- $M$: number of SNPs, then $h^2/M$ is the average heritability explained per SNP
- $a$: contribution of confounding biases
- $\ell_j = \sum_kr_{jk}^2$: LD score of variant $j$, which measures the amount of genetic variation tagged by $j$

Consequently, if regress $\chi^2$ from GWAS against LD score, the intercept minus one is an estimator of the mean contribution of confounding bias to the inflation in the test statistics.

## Model

Model phenotypes as

\[\phi = X\beta + \epsilon\]where

- $\phi$: $N\times 1$ vector of (quantitative) phenotypes
- $X$: an $N\times M$ matrix of genotypes normalized to mean zero and variable one
*($\sum_{i=1}^NX_{ij} =0, \frac 1N X_j^TX_j=1$)* - $\beta$: per-normalized-genotype effect sizes
- $\bbE[\epsilon]=0, \Var[\epsilon]=(1-h_g^2)I, \bbE[\beta]=0, \Var[\beta] = (h_g^2/M)I$
- assume genotype at variant $j$ for individual $i$ is independent of other individualsâ€™ genotypes
- incorporate linkage disequilibrium: define $r_{jk} = \bbE[X_{ij}X_{ik}]$, which does not depend on $i$
- assume $X, \beta$ and $\epsilon$ are mutually independent

For each variant $j=1,\ldots,M$, compute least-squares estimates of effect size

\[\hat\beta_j=(X_j^TX_j)^{-1}(X_j^T\phi)=(X_j^T\phi)/N\]and $\chi_j^2$-statistics $\chi_j^2=N\hat\beta_j^2$.

Define the LD Score of variant $j$ as

\[\ell_j = \sum_{k=1}^M r_{jk}^2\,.\]The expected $\chi^2$-statistic of variant $j$ is

\[\bbE[\chi_j^2]\approx \frac{Nh_g^2}{M}\ell_j+1\]

Firstly,

\[\bbE[\chi_j^2] = N\bbE[\hat\beta_j^2] = N\left(\Var[\hat\beta_j] + (\bbE[\hat\beta_j])^2\right)=N\cdot\Var[\hat\beta_j]\]By the law of total variance,

\[\Var[\hat\beta_j] = \bbE[\Var[\hat\beta_j\mid X]] + \Var[\bbE[\hat\beta_j\mid X]] = \bbE[\Var[\hat\beta_j\mid X]].\]Note that,

\[\Var[\hat\beta_j\mid X] = \frac{1}{N^2}\Var[X_j^T\phi\mid X] = \frac{1}{N^2}X_j^T\Var[\phi\mid X]X_j = \frac{1}{N^2}\left(\frac{h_g^2}{M}X_j^TXX^TX_j+N(1-h_g^2)\right)\]write the first term as

\[\frac{1}{N^2}X_j^TXX^TX_j = \sum_{k=1}^M\tilde r_{jk}^2\]where $\tilde r_{jk} = \frac 1N\sum_{i=1}^NX_{ij}X_{ik}$ denotes the sample correlation between additively-coded genotypes at variants $j$ and $k$.

Since

\[\bbE[\tilde r_{jk}^2] \approx r_{jk}^2 + (1-r_{jk}^2)/N\]*(??????why?????)*

then

\[\bbE\left[\sum_{k=1}^M\tilde r_{jk}^2\right]\approx \ell_j + \frac{M-\ell_j}{N}\]Thus,

\[\bbE[\chi_j^2] \approx N\left(\frac{h_g^2}{M}\left[\ell_j + \frac{M-\ell_j}{N}\right] + \frac 1N(1-h_g^2)\right) = h_g^2\left(\frac{N\ell_j+M-\ell_j}{M} - 1\right) + 1 = \frac{N(1-1/N)h_g^2}{M}\ell_j +1 \approx \frac{Nh_g^2}{M}\ell_j +1\]