# Genetic Relatedness in High-Dimensional Linear Models

##### Posted on

This post is based on Guo, Z., Wang, W., Cai, T. T., & Li, H. (2019). Optimal Estimation of Genetic Relatedness in High-Dimensional Linear Models. Journal of the American Statistical Association, 114(525), 358–369.

Consider estimating the genetic relatedness between two traits based on the genome-wide association data. In the framework of high-dimensional linear models, introduce two measures of genetic relatedness and develop optimal estimators for them:

- genetic covariance: the inner product of the two regression vectors
- genetic correlation: normalized inner produce by their lengths

propose functional de-biased estimators (FDEs), consists of

- an initial step with the plug-in scaled lasso estimator
- a further bias correction step

develop estimators of the quadratic functionals of the regression vectors (can be used to estimate the heritability of each trait)

the estimators are shown to be minimax rate-optimal.

## Introduction

### Motivation and Background

GWAS have identified thousands of genetic variants or SNPs that are associated with various complex phenotypes, and the results show that many complex phenotypes share common genetic variants, including various autoimmune diseases and psychiatric disorders (精神疾病).

The concept of genetic relatedness or genetic correlations has been proposed to describe the shared genetic associations within pairs of quantitative traits based on GWAS data. This is in contrast to the traditional approaches of estimating co-heritability based on twin or family studies, where measurements of both traits are required on the same set of individuals.

Due to the availability of GWAS datasets of many important traits, there has been significant recent interest in methods for quantifying and estimating the genetic relatedness between two traits based on large-scale genetic association data.

Several measures of genetic relatedness using GWAS data:

### Definition and Problem Formulation

A pair of trait values $(\y, \w)$ are modeled as a linear combination of $p$ genetic variants and an error term that inclueds environmental and unmeasured genetic effects,

where the rows $X_{i\cdot}$ are iid $p$-dimensional sub-Gaussian random vectors with covariance matrix $\Sigma$, the rows $Z_{i\cdot}$ are iid $p$-dimensional sub-Gaussian random vectors with covariance matrix $\Gamma$, and the error $(\epsilon,\delta)^T$ follows the multivariate normal distribution with mean zero and covariance

and is assumed to be independent of $X$ and $Z$.

Assume the pair of traits $y$ and $w$ have mean zero, and the $j$-th column $X_{\cdot j}$, $Z_{\cdot j}$ are the numerically coded genetic markers at the $j$-th genetic variants and are assumed to have mean zero and variance 1.

Under this model, if the columns of $X$ and $Z$ are independent, for the $i$-th observation,

then $\Vert \beta\Vert_2^2/(\Vert \beta\Vert_2^2+\sigma_1^2)$ and $\Vert \gamma\Vert_2^2/(\Vert \gamma\Vert_2^2+\sigma^2_2)$ can then be interpreted as the **narrow sense heritability**.

Then one measure of genetic relatedness is the inner product of the regression coefficients *(not clear yet)*

and a normalized inner product called genetic correlation,

### Methods and Main Results

a naive estimator: estimate $\beta$ and $\gamma$ first, then plug in - lasso and scaled lasso: but the accumulation of weak effects may contribute significantly to the trait variability - marginal regression with screening: also suffered from weak effects two step procedure: functional de-biased estimators (FDEs) 1. estimate by the plug-in scaled lasso estimator 2. correct the plug-in scaled lasso estimator

comparison:

- the plug-in estimator of the scaled Lasso estimators: suffer from a large bias
- the plug-in of the de-biased Lasso estimators: suffer from the inflation of variance
- the FDE estimator: balance the bias and variance in the optimal way

*(other impression: good writing and typesetting)*