Rare Variant Association Testing

Posted on Jul 18, 2019 (Update: Jan 15, 2020)

This note is based on

SKAT

• common variants identified by GWASs often explain only a small proportion of trait heritability
• rare genetic variants, alleles with a frequency less than 1%-5%
• standard methods used to test for association with single common genetic variants are underpowered for rare variants unless sample sizes or effect sizes are very large

A logical alternative approach: burden tests

• assess the cumulative effects of multiple variants in a genomic region.
• limitation: implicitly assume that all rare variants influence the phenotype in the same direction and with the same magnitude of effect, but most variants (common or rare) within a sequenced region to have little or no effect on phenotype, whereas some variants are protective and others deleterious, and the magnitude of each variant’s effect is likely to vary.

non-burden-based test: C-alpha test

• robust to the direction and magnitude of effect
• not allow for easy covariate adjustment
• also use permutation

SKAT: tests for association between variants in a region (both common and rare) and a dichotomous (e.g. case-control) or continuous phenotype while adjusting for covariates.

• a multiple regression model to directly regress the phenotype on genetic variants in a region and on covariates
• kernel association test to test the regression coefficients of the variants by using a variance-component score test in a mixed-model framework by accounting for rare variants
• allowance for epistatic variant effects.

epistasis（上位效应）

Assume $n$ subjects are sequenced in a region with $p$ variant sites observed.

For the $i$-th subject,

• $y_i$: the phenotype variable
• $X_i=(X_{i1},\ldots,X_{im})$: the covariates, might include age, gender, and top principal components of genetic variation for controlling population stratification.
• $G_i=(G_{i1},\ldots,G_{ip})$: the genotype for the $p$-variants within the region.

Consider the linear model

$y_i=\alpha_0 + \alpha'X_i + \beta'G_i + \varepsilon_i$

when the phenotypes are continuous traits, and the logistic model

$\logit P(y_i=1) = \alpha_0 + \alpha'X_i + \beta'G_i$

when the phenotypes are dichotomous.

Evaluating whether the gene variants influence the phenotype, adjusting for covariates, corresponding to testing the null hypothesis $H_0:\beta=0$.

The standard p-DF (?? TODO) likelihood ratio test has little power, especially for rare variants. SKAT tests $H_0$ by assuming each $\beta_j$ follows an arbitrary distribution with a mean of zero and a variance of $w_j\tau$, where $\tau$ is a variance component and $w_j$ is a pre-specified weight for variant $j$. One can easily see that $H_0:\beta=0$ is equivalent to testing $H_0:\tau = 0$, which can be tested with a variance-component score test in the corresponding mixed model; this is known to be a locally most powerful test.

Specifically, the variance-component score statistic is

$Q = (y - \hat \mu)'K (y - \hat\mu)\,,$

where $K=GWG’$, $\hat \mu$ is the predicted mean of $y$ under $H_0$.

• $G$: an $n\times p$ matrix with the $(i, j)$-th element being the genotype of variant $j$ of subject $i$
• $W = \diag(w_1,\ldots,w_p)$ contains the weights of the $p$ variants

And $K(G_i, G_{i’})=\sum_{j=1}^pw_jG_{ij}G_{i’j}$ measures the genetic similarity between subject $i$ and $i’$ in the region via the $p$ markers.

In practice, propose $\sqrt{w_j}=\mathrm{Beta}(\mathrm{MAF}_j;a_1,a_2)$

• $0 < a_1 \le 1$ and $a_2 \ge 1$: increasing the weight of rarer variants and decreasing the weight of common weights, a smaller $a_1$ results in more strongly increasing the weight of rarer variants
• $a_1 = a_2 = 1$, reduce to Uniform distribution, corresponds to $w_j = 1$
• $\sqrt{w_j} = 1/\sqrt{\mathrm{MAF}_j(1-\mathrm{MAF}_j)}$

The following figure (with the Julia code for plotting) shows such examples of weights, in which the first row is the pdf, while the second row is the normalized weights as the paper presents.

using Distributions
using Plots
using LaTeXStrings
p1 = plot(0:0.01:1, pdf.(Beta(1, 25), 0:0.01:1), label = L"\alpha = 1, \beta = 25", ylim = (0, 5), ylabel = "pdf")
plot!(p1, 0:0.01:1, pdf.(Beta(1, 1), 0:0.01:1), label = L"\alpha = 1, \beta = 1")
plot!(p1, 0:0.01:1, pdf.(Beta(0.5, 0.5), 0:0.01:1), label = L"\alpha = 0.5, \beta = 0.5")
p2 = plot(0:0.01:1, logpdf.(Beta(1, 25), 0:0.01:1), label = L"\alpha = 1, \beta = 25", ylabel = "logpdf")
p3 = plot(0:0.001:0.5, pdf.(Beta(1, 25), 0:0.001:0.5) ./ pdf(Beta(1, 25), 1e-3), label = L"\alpha = 1, \beta = 25", ylim = (0, 1),
ylabel = "Weight", xlabel = "MAF")
plot!(p3, 0:0.001:0.5, pdf.(Beta(1, 1), 0:0.001:0.5), label = L"\alpha = 1, \beta = 1")
plot!(p3, 0:0.001:0.5, pdf.(Beta(0.5, 0.5), 0:0.001:0.5) ./ pdf(Beta(0.5, 0.5), 1e-3), label = L"\alpha = 0.5, \beta = 0.5")
p4 = plot(0:0.001:0.05, pdf.(Beta(1, 25), 0:0.001:0.05) ./ pdf(Beta(1, 25), 1e-3), label = L"\alpha = 1, \beta = 25", ylim = (0, 1),
ylabel = "Weight", xlabel = "MAF")
plot!(p4, 0:0.001:0.05, pdf.(Beta(1, 1), 0:0.001:0.05), label = L"\alpha = 1, \beta = 1")
plot!(p4, 0:0.001:0.05, pdf.(Beta(0.5, 0.5), 0:0.001:0.05) ./ pdf(Beta(0.5, 0.5), 1e-3), label = L"\alpha = 0.5, \beta = 0.5")
plot(p1, p2, p3, p4)


ZFA

The challenge of analysing genome sequencing datasets, besides multiple-testing issues arising from high dimensionality, centres on extreme low allele frequencies–classical statistical tests lose power on genetic variants with small variance.

In whole genome or exome sequencing data, over 99% of the variants have minor allele frequency (MAF) below 1%.

MAF in “变异过滤”

• 数据质量
• 人群频率：不同位点的基因型在人群中的频率是不一样的，不同人群致病变异的携带率也不大一样。一般来说，人群中频率高的突变往往是没有致病性，可以用来过滤，排除常见的良性变异。基因变异程度可根据 次等位基因频率 (minor allele frequency, MAF) 划分，MAF 介于 5%-50% 之间称为常见变异，MAF 介于 1%-5% 之间的为罕见变异。在研究罕见病的致病变异时，应该过滤掉非罕见变异。
• 变异分类
• 疾病知识
• 危害性预测结果

A number of rare variant association tests has been proposed to improved power:

• pulling adjacent variants together and up-weighting the minor allele
• applying a linear mixed model on a certain genomic region (variance component tests)

fixed window

For most rare variant methods, assume a fixed genomic region for testing, such as a gene or a fixed window.

Drawbacks: - unnecessary noise in a fixed window - exome data have unknown gene functions - => need to optimize the collapsing region

Strength: most effective when the true signals display a certain degree of clustering - not unreasonable as linkage disequilibrium (?) exists among adjacent variants - variants in the same gene regulatory element would physically cluster in DNA sequences

sliding window

scan statistics: need permutations - not possible to select window size in an exhaustive manner for whole genome evaluation - no stand-alone window size optimization for other (?) rare variant association tests.

ZFA

optimize the testing region for any region-based rare variant association test as a wrapper function

• zooming: partition a fixed genomic region by an order of two, search across all partition levels to identify the region with maximum information, evaluated by the smallest association $P$-value of that partition
• focus: refine the region by adding or subtracting adjacent variants near the boundaries.

Materials and Methods

Zoom-Focus algorithm

• $P$: total number of variants in a fixed window
• $R$: maximum order of binary partitions for $P$, $R=\arg\max{r:2^{r-1}<P, r=1,2,\ldots}$
• $r$: the order of partitions, the corresponding number of regions is $n_r=2^{r-1}$
• $d$: the number of variants in a partitioned region, $d=P/n_r=P/2^{r-1}$
• $c$: the index for the $c$-th partition, $c=1,\ldots,n_r$.
• $\phi(d, c)={x_{cj},j=1,\ldots,d}$: the region that contains variants in a partition size $d$ and index $c$
• $f(\cdot)$ be the Bonferroni corrected $P$-value calculated by rare variant method $F(\cdot)$ measured on region $\phi(d,c)$: $f(d,c)=n_rF[\phi(d,c)]$.

Zooming:

$(\hat d, \hat c) = \arg\min\{f(d,c),r=1,\ldots,R;c=1\text{ to }n_r\}\,.$

Focusing:

Refine the boundaries of $(\hat d,\hat c)$ by extending both lower and upper bounds,

\begin{align} LB_f &=\arg\min\{f(LB',UB), LB'=LB+i;i=[-\hat d/2, \hat d/4]\}\\ UB_f &=\arg\min\{f(LB_f,UB'),UB'=UB+i;i=[-\hat d/4, \hat d/2]\} \end{align}

Then the $P$-value of ZFA on a given window of $P$ variants is:

$p\text{-value} = (k+P-1)F(LB_f,UB_f)\,,$

where $P-1$ is the number of tests in the zooming step, and $k$ is the number of tests in the Focusing step.

fast-ZOOM

Published in categories Note