Generalized Higher Criticism in GWAS
Posted on
generalized higher criticism for Testing SNP-Set effects in Genetic Association Studies
- individual SNP effects are generally weak, and the disease/trait associated SNPs identified in GWAS are insufficient in explaining much of the heritability of complex diseases and traits, even for highly heritable traits, such as height
- these findings suggest that single SNP analysis may be underpowered. This is particularly the case for low-frequency SNPs in sequencing association studies
- region-based analyses: combine information from multiple SNPs in a genetic construct
- genes, gene networks, and pathways are examples of genetic constructs that are likely to have multiple SNPs that function simultaneously to affect diseases and traits, for example, due to functional similarity or interaction
- the signal SNPs in a genetic construct are likely to be sparse and have weak signals. Hence, a methodology that does not require strong marginal SNP effects but is capable of aggregating these weak and sparse SNP effects together into a detectable signal at the genetic construct level, such as a gene, is needed to help increase the chance of detecting the effects of these genetic constructs and find the causes of the missing heritability
CGEM GWAS breast cancer study
- several SNPs in the FGFR2 region showed strong evidence of association with breast cancer risk using invidual SNP analysis
-
but none of these SNPs reached genome-wide significance when analyzing the CGEM GWAS data using the traditional individual SNP analysis
- sparse signals in an SNP set, present a particularly difficult problem for detection.
several methods for SNP-set testing:
- MinP
- variance-component tests
- Sequence Kernel Association Test (SKAT)
HC is a global test that combines information over all the mariginal test statistics of a set of variables
2 GLM and Marginal SNP Score Test Statistics
- $N$ individuals genotyped over a region with $p$ observed $p$ observed SNPs in a SNP-set
- possible SNP-sets include genes, gene networks, or genetic pathways
- phenotypes $Y = [Y_1,\ldots, Y_N]^\top$
- $N\times p$ genotype matrix $G$ is constructed such that $G_i = [G_{i1},\ldots, G_{ip}]^\top$
- $N\times q$ covariate matrix $X$
conditional on $(X_i, G_i)$, $Y_i$ follows a distribution in the exponential family
\[f(Y_i) = \exp((Y_i\theta_i - b(\theta_i)) / a_i(\phi) + c(Y_i, \phi))\]to construct a mareginal test between the $j$-th SNP and $Y$, model
\[\mu_i = E(Y_i\mid G_i, X_i) = b'(\theta_i)\](why the derivative?) {:.comment}
using the GLM
\[g(\mu_i) = X_i^\top\alpha + G_i^\top\beta\]the variance of $Y_i$ is $\Var(Y_i) = a_i(\phi)\nu(\mu_i)$, where $\nu(\mu_i) = b’’(\theta_o)$ is a variance function.
testing for the overall effect of the SNP set $G_i$, which corresponds to the global null $H_0: \beta = 0$
Let $W$ and $P$, the marginal score test statistic for $\beta_j$ under the global null is
\[Z_j = \frac{G_j^\top (Y-\hat\mu_0)}{\sqrt{G_j^\top P G_j}}\]where $\hat\mu_0 = \mu(X\hat\alpha)$, $\hat\alpha$ is the MLE of $\alpha$ under the null model of $g(\mu_i) = X_i^\top\alpha$
these individual SNP test statistics are asymptotically jointly distributed as $Z\sim MVN(0, \Sigma)$, where we estimate $\cov(Z_j, Z_k)$
while the $Z$ are correlated, we define the uncorrelated transformed test statistics $Z^\star$ to be
\[Z^\star = U^{-1}Z\sim MVN(0, I_p)\]where $UU^\top = \hat \Sigma$ is the Cholesky decomposition.
3. The Higher Criticism Test
because $Z$ are correlated, based on the transformed $Z^\star$.
\[S^\star(t) = \sum_{j=1}^p 1_{\vert Z_j^\star\vert \ge t}\]