WeiYa's Work Yard

A traveler with endless curiosity, who fell into the ocean of statistics, tries to write down his ideas and notes to save himself.

Tutorial on Polygenic Risk Score

Posted on
Tags: Polygenic Risk Score, Genome-wide Association Studies

This note is based on Choi, S. W., Mak, T. S.-H., & O’Reilly, P. F. (2020). Tutorial: A guide to performing polygenic risk score analyses. Nature Protocols, 15(9), Article 9.

Polygenic score (PGS) or polygenic risk score (PRS): an estimate of an individual’s genetic liability to a trait or disease, calculated according to

  • their genotype profile and
  • relevant genome-wide association study (GWAS) data

Applications:

  • assess shared etiology between phenotypes
  • evaluate the clinical utility of genetic data for complex disease
  • as part of experimental studies in which, e.g.,
    • experiments are performed that compare outcomes (e.g., gene expression and cellular response to treatment) between individual with low and high PRS values

As GWAS sample sizes increase and PRSs become more powerful, PRSs are set to play a key role in research and stratified medicine.

Despite the importance and growing application of PRSs, there are limited guidelines for performing PRS analyses, which can lead to inconsistency between studies and misinterpretation of results.

The paper provide detailed guidelines for performing and interpreting PRS analyses

  • outline standard quality control steps
  • discuss different methods for the calculation of PRSs
  • provide an introductory online tutorial
  • highlight common misconceptions relating to PRS results
  • offer recommendations for best practice
  • discuss future challenges

Introduction

GWAS have identified a large number of genetic variants, mostly SNPs, significantly associated with a wide range of complex traits. But these variants typically have a small effect and correspond to a small fraction of truly associated variants, meaning that they have limited predictive power.

  • a linear mixed model: much of the heritability of height can be explained by evaluating the effects of all SNPs simultaneously
  • LD score regression and PRS method have also aggregated the effects of variants across the genome to estimate heritability, to infer genetic overlap between traits and to predict phenotypes based on genetic profile

LD score regression and PRS can all to infer heritability and shared etiology among complex traits, PRS is the only approach that provides an estimate of genetic liability to a trait at the individual level.

Introduction to PRSs

define PRSs as a single value estimate of an individual’s genetic liability to a phenotype, calculated as a sum of their genome-wide genotypes, weighted by corresponding genotype effect size estimates derived from GWAS summary statistic data.

PRS analyses can be characterized by two key input data sets that they require

  • base data (GWAS), consisting of summary statistics (e.g., betas and P values) of genotype-phenotype associations at genetic variants (hereafter SNPs) genome-wide
  • target data: consisting of genotypes, and usually also phenotypes, in individuals from a sample to which the researchers performing the PRS analysis have access (often not publicly available)
    • typically formatted as PLINK binary files

https://choishingwan.github.io/PRS-Tutorial/

Step 1: QC of Base Data

$ gunzip -c Height.gwas.txt.gz | head
CHR     BP      SNP     A1      A2      N       SE      P       OR      INFO    MAF
1       756604  rs3131962       A       G       388028  0.00301666      0.483171        0.997886915712657     0.890557941364774       0.369389592764921
1       768448  rs12562034      A       G       388028  0.00329472      0.834808        1.00068731609353      0.895893511351165       0.336845754096289
1       779322  rs4040617       G       A       388028  0.00303344      0.42897 0.997603556067569    0.897508290615237        0.377368010940814
1       801536  rs79373928      G       T       388028  0.00841324      0.808999        1.00203569922793      0.908962856432993       0.483212245374095
1       808631  rs11240779      G       A       388028  0.00242821      0.590265        1.00130832511154      0.893212523690488       0.450409558999587
1       809876  rs57181708      G       A       388028  0.00336785      0.71475 1.00123165786833     0.923557624081969        0.499743932656759
1       835499  rs4422948       G       A       388028  0.0023758       0.710884        0.999119752645202     0.906437735120596       0.481016005816168
1       838555  rs4970383       A       C       388028  0.00235773      0.150993        0.996619945289758     0.907716506801574       0.327164029672754
1       840753  rs4970382       C       T       388028  0.00207377      0.199967        0.99734567895614      0.914602590137255       0.498936220426316

check the file integrity

$ md5sum Height.gwas.txt.gz 
a2b15fb6a2bbbe7ef49f67959b43b160  Height.gwas.txt.gz

remove SNPs with MAF < 1% AND INFO < 0.8

gunzip -c Height.gwas.txt.gz | awk 'NR==1 || ($11 > 0.01) && ($10 > 0.8) {print}' | gzip > Height.gz

remove duplicated SNPs

gunzip -c Height.gz | awk '{seen[$3]++; if (seen[$3]==1){print}}' | gzip > Height.nodup.gz

there are 2 duplicated SNPs,

$ gunzip -c Height.gz | awk '{seen[$3]++; if (seen[$3]>1){print}}'
3       182254751       rs7622072       C       G       388028  0.00526681      0.126843        0.991991549879677     0.903220366820217       0.472790254192978
5       139208711       rs113309830     C       G       388028  0.00377597      0.542249        0.997701505592967     0.90708730603453        0.436439940493625

remove ambiguous SNPs,

$ gunzip -c Height.nodup.gz | awk '!( ($4=="A" && $5=="T") || ($4=="T" && $5=="A") || ($4=="G") && ($5=="C") || ($4=="C" && $5=="G")) {print}' | gzip > Height.QC.gz
$ gunzip -c Height.QC.gz | wc -l
499618

Step 2: QC of Target Data

$ unzip EUR.zip
$ ~/Programs/plink1.9/plink --bfile EUR --maf 0.01 --hwe 1e-6 --geno 0.01 --mind 0.01 --write-snplist --make-just-fam --out EUR.QC
PLINK v1.90b7 64-bit (16 Jan 2023)             www.cog-genomics.org/plink/1.9/
(C) 2005-2023 Shaun Purcell, Christopher Chang   GNU General Public License v3
Logging to EUR.QC.log.
Options in effect:
  --bfile EUR
  --geno 0.01
  --hwe 1e-6
  --maf 0.01
  --make-just-fam
  --mind 0.01
  --out EUR.QC
  --write-snplist

19690 MB RAM detected; reserving 9845 MB for main workspace.
551892 variants loaded from .bim file.
503 people (240 males, 263 females) loaded from .fam.
14 people removed due to missing genotype data (--mind).
IDs written to EUR.QC.irem .
Using 1 thread (no multithreaded calculations invoked).
Before main variant filters, 489 founders and 0 nonfounders present.
Calculating allele frequencies... done.
Warning: 1806 het. haploid genotypes present (see EUR.QC.hh ); many commands
treat these as missing.
Total genotyping rate in remaining samples is 0.999816.
5353 variants removed due to missing genotype data (--geno).
Warning: --hwe observation counts vary by more than 10%, due to the X
chromosome.  You may want to use a less stringent --hwe p-value threshold for X
chromosome variants.
--hwe: 944 variants removed due to Hardy-Weinberg exact test.
5061 variants removed due to minor allele threshold(s)
(--maf/--max-maf/--mac/--max-mac).
540534 variants and 489 people pass filters and QC.
Note: No phenotypes present.
List of variant IDs written to EUR.QC.snplist .
--make-just-fam to EUR.QC.fam ... done.

perform pruning to remove highly correlated SNPs

$ ~/Programs/plink1.9/plink --bfile EUR --keep EUR.QC.fam --extract EUR.QC.snplist --indep-pairwise 200 50 0.25 --out EUR.QC
PLINK v1.90b7 64-bit (16 Jan 2023)             www.cog-genomics.org/plink/1.9/
(C) 2005-2023 Shaun Purcell, Christopher Chang   GNU General Public License v3
Logging to EUR.QC.log.
Options in effect:
  --bfile EUR
  --extract EUR.QC.snplist
  --indep-pairwise 200 50 0.25
  --keep EUR.QC.fam
  --out EUR.QC

19690 MB RAM detected; reserving 9845 MB for main workspace.
551892 variants loaded from .bim file.
503 people (240 males, 263 females) loaded from .fam.
--extract: 540534 variants remaining.
--keep: 489 people remaining.
Using 1 thread (no multithreaded calculations invoked).
Before main variant filters, 489 founders and 0 nonfounders present.
Calculating allele frequencies... done.
Warning: 1273 het. haploid genotypes present (see EUR.QC.hh ); many commands
treat these as missing.
Total genotyping rate in remaining samples is 0.999919.
540534 variants and 489 people pass filters and QC.
Note: No phenotypes present.
Pruned 20214 variants from chromosome 1, leaving 21127.
Pruned 20984 variants from chromosome 2, leaving 20879.
Pruned 17463 variants from chromosome 3, leaving 17812.
Pruned 16344 variants from chromosome 4, leaving 16690.
Pruned 15027 variants from chromosome 5, leaving 15981.
Pruned 20731 variants from chromosome 6, leaving 15475.
Pruned 14046 variants from chromosome 7, leaving 14606.
Pruned 13289 variants from chromosome 8, leaving 13667.
Pruned 10723 variants from chromosome 9, leaving 11951.
Pruned 12722 variants from chromosome 10, leaving 13383.
Pruned 13053 variants from chromosome 11, leaving 12666.
Pruned 11654 variants from chromosome 12, leaving 12500.
Pruned 8429 variants from chromosome 13, leaving 9609.
Pruned 8072 variants from chromosome 14, leaving 8791.
Pruned 7978 variants from chromosome 15, leaving 8470.
Pruned 8808 variants from chromosome 16, leaving 9500.
Pruned 8032 variants from chromosome 17, leaving 8724.
Pruned 7204 variants from chromosome 18, leaving 8560.
Pruned 7014 variants from chromosome 19, leaving 6726.
Pruned 6341 variants from chromosome 20, leaving 7508.
Pruned 3629 variants from chromosome 21, leaving 4280.
Pruned 4085 variants from chromosome 22, leaving 4323.
Pruned 16235 variants from chromosome 23, leaving 5229.
Pruning complete.  272077 of 540534 variants removed.
Marker lists written to EUR.QC.prune.in and EUR.QC.prune.out .

calculate Heterozygosity rates

$ ~/Programs/plink1.9/plink --bfile EUR --keep EUR.QC.fam --extract EUR.QC.prune.in --het --out EUR.QC
PLINK v1.90b7 64-bit (16 Jan 2023)             www.cog-genomics.org/plink/1.9/
(C) 2005-2023 Shaun Purcell, Christopher Chang   GNU General Public License v3
Logging to EUR.QC.log.
Options in effect:
  --bfile EUR
  --extract EUR.QC.prune.in
  --het
  --keep EUR.QC.fam
  --out EUR.QC

19690 MB RAM detected; reserving 9845 MB for main workspace.
551892 variants loaded from .bim file.
503 people (240 males, 263 females) loaded from .fam.
--extract: 268457 variants remaining.
--keep: 489 people remaining.
Using 1 thread (no multithreaded calculations invoked).
Before main variant filters, 489 founders and 0 nonfounders present.
Calculating allele frequencies... done.
Warning: 558 het. haploid genotypes present (see EUR.QC.hh ); many commands
treat these as missing.
Total genotyping rate in remaining samples is 0.999958.
268457 variants and 489 people pass filters and QC.
Note: No phenotypes present.
--het: 263228 variants scanned, report written to EUR.QC.het .

remove individuals with F coefficients that are more than 3 standard deviation (SD) units from the mean

dat <- read.table("EUR.QC.het", header=T) # Read in the EUR.het file, specify it has header
m <- mean(dat$F) # Calculate the mean  
s <- sd(dat$F) # Calculate the SD
valid <- subset(dat, F <= m+3*s & F >= m-3*s) # Get any samples with F coefficient within 3 SD of the population mean
write.table(valid[,c(1,2)], "EUR.valid.sample", quote=F, row.names=F) # print FID and IID for valid samples
q() # exit R

perform sex check in PLINK, where individuals are called as females if their X chromesome homozygosity estimate is < 0.2 and as males if the estimate is > 0.8

$ ~/Programs/plink1.9/plink --bfile EUR --keep EUR.valid.sample --extract EUR.QC.prune.in --check-sex --out EUR.QC
PLINK v1.90b7 64-bit (16 Jan 2023)             www.cog-genomics.org/plink/1.9/
(C) 2005-2023 Shaun Purcell, Christopher Chang   GNU General Public License v3
Logging to EUR.QC.log.
Options in effect:
  --bfile EUR
  --check-sex
  --extract EUR.QC.prune.in
  --keep EUR.valid.sample
  --out EUR.QC

19690 MB RAM detected; reserving 9845 MB for main workspace.
551892 variants loaded from .bim file.
503 people (240 males, 263 females) loaded from .fam.
--extract: 268457 variants remaining.
--keep: 487 people remaining.
Using 1 thread (no multithreaded calculations invoked).
Before main variant filters, 487 founders and 0 nonfounders present.
Calculating allele frequencies... done.
Warning: 558 het. haploid genotypes present (see EUR.QC.hh ); many commands
treat these as missing.
Total genotyping rate in remaining samples is 0.999958.
268457 variants and 487 people pass filters and QC.
Note: No phenotypes present.
--check-sex: 5229 Xchr and 0 Ychr variant(s) scanned, 4 problems detected.
Report written to EUR.QC.sexcheck .
# Read in file
valid <- read.table("EUR.valid.sample", header=T)
dat <- read.table("EUR.QC.sexcheck", header=T)
valid <- subset(dat, STATUS=="OK" & FID %in% valid$FID)
write.table(valid[,c("FID", "IID")], "EUR.QC.valid", row.names=F, col.names=F, sep="\t", quote=F) 
q() # exit R

remove related individuals

$ ~/Programs/plink1.9/plink --bfile EUR --keep EUR.QC.valid --extract EUR.QC.prune.in --rel-cutoff 0.125 --out EUR.QC
PLINK v1.90b7 64-bit (16 Jan 2023)             www.cog-genomics.org/plink/1.9/
(C) 2005-2023 Shaun Purcell, Christopher Chang   GNU General Public License v3
Logging to EUR.QC.log.
Options in effect:
  --bfile EUR
  --extract EUR.QC.prune.in
  --keep EUR.QC.valid
  --out EUR.QC
  --rel-cutoff 0.125

19690 MB RAM detected; reserving 9845 MB for main workspace.
551892 variants loaded from .bim file.
503 people (240 males, 263 females) loaded from .fam.
--extract: 268457 variants remaining.
--keep: 483 people remaining.
Using up to 4 threads (change this with --threads).
Before main variant filters, 483 founders and 0 nonfounders present.
Calculating allele frequencies... done.
Total genotyping rate in remaining samples is 0.999957.
268457 variants and 483 people pass filters and QC (before --rel-cutoff).
Note: No phenotypes present.
Excluding 5229 variants on non-autosomes from relationship matrix calc.
Relationship matrix calculation complete.
0 people excluded by --rel-cutoff.
Remaining sample IDs written to EUR.QC.rel.id .

generate final QC’ed target data file (skip the mismatching steps since it is said most PRS software would perform strand-flipping automatically)

$ ~/Programs/plink1.9/plink --bfile EUR --make-bed --keep EUR.QC.rel.id --extract EUR.QC.snplist --out EUR.QC
PLINK v1.90b7 64-bit (16 Jan 2023)             www.cog-genomics.org/plink/1.9/
(C) 2005-2023 Shaun Purcell, Christopher Chang   GNU General Public License v3
Logging to EUR.QC.log.
Options in effect:
  --bfile EUR
  --extract EUR.QC.snplist
  --keep EUR.QC.rel.id
  --make-bed
  --out EUR.QC

19690 MB RAM detected; reserving 9845 MB for main workspace.
551892 variants loaded from .bim file.
503 people (240 males, 263 females) loaded from .fam.
--extract: 540534 variants remaining.
--keep: 483 people remaining.
Using 1 thread (no multithreaded calculations invoked).
Before main variant filters, 483 founders and 0 nonfounders present.
Calculating allele frequencies... done.
Total genotyping rate in remaining samples is 0.999918.
540534 variants and 483 people pass filters and QC.
Note: No phenotypes present.
--make-bed to EUR.QC.bed + EUR.QC.bim + EUR.QC.fam ... done.

Step 3: Calculating and Analysing PRS

available programs:

  • PLINK
  • PRSice-2
  • LDPred-2
  • lassosum

use PLINK

update effect size by performing a logarithm transformation

dat <- read.table(gzfile("Height.QC.gz"), header=T)
dat$BETA <- log(dat$OR)
write.table(dat, "Height.QC.Transformed", quote=F, row.names=F)
q() # exit R

LD corresponds to the correlation between the genotypes of genetic variants across the genome, makes identifying the contribution from causal independent genetic variants extremely challenging.

One way of approximately capturing the right level of casual signal is to perform clumping, which removes SNPs in ways that only weakly correlated SNPs are retained but preferentially retaining the SNPs most associated with the phenotype under study.

$ ~/Programs/plink1.9/plink --bfile EUR.QC --clump-p1 1 --clump-r2 0.1 --clump-kb 250 --clump Height.QC.Transformed --clump-snp-field SNP --clump-field P --out EUR
PLINK v1.90b7 64-bit (16 Jan 2023)             www.cog-genomics.org/plink/1.9/
(C) 2005-2023 Shaun Purcell, Christopher Chang   GNU General Public License v3
Logging to EUR.log.
Options in effect:
  --bfile EUR.QC
  --clump Height.QC.Transformed
  --clump-field P
  --clump-kb 250
  --clump-p1 1
  --clump-r2 0.1
  --clump-snp-field SNP
  --out EUR

19690 MB RAM detected; reserving 9845 MB for main workspace.
540534 variants loaded from .bim file.
483 people (232 males, 251 females) loaded from .fam.
Using 1 thread (no multithreaded calculations invoked).
Before main variant filters, 483 founders and 0 nonfounders present.
Calculating allele frequencies... done.
Total genotyping rate is 0.999918.
540534 variants and 483 people pass filters and QC.
Note: No phenotypes present.
Warning: 'rs1076829' is missing from the main dataset, and is a top variant.
Warning: 'rs3129818' is missing from the main dataset, and is a top variant.
Warning: 'rs3118359' is missing from the main dataset, and is a top variant.
9798 more top variant IDs missing; see log file.
--clump: 193758 clumps formed from 489816 top variants.
Results written to EUR.clumped .

extract SNP ID

awk 'NR!=1 {print $3}' EUR.clumped > EUR.valid.snp

generate PRS,

awk '{print $3, $8}' Height.QC.Transformed > SNP.pvalue
echo "0.001 0 0.001" > range_list 
echo "0.05 0 0.05" >> range_list
echo "0.1 0 0.1" >> range_list
echo "0.2 0 0.2" >> range_list
echo "0.3 0 0.3" >> range_list
echo "0.4 0 0.4" >> range_list
echo "0.5 0 0.5" >> range_list

image

incorporate principal components as covariates

$ ~/Programs/plink1.9/plink --bfile EUR.QC --indep-pairwise 200 50 0.25 --out EUR
PLINK v1.90b7 64-bit (16 Jan 2023)             www.cog-genomics.org/plink/1.9/
(C) 2005-2023 Shaun Purcell, Christopher Chang   GNU General Public License v3
Logging to EUR.log.
Options in effect:
  --bfile EUR.QC
  --indep-pairwise 200 50 0.25
  --out EUR

19690 MB RAM detected; reserving 9845 MB for main workspace.
540534 variants loaded from .bim file.
483 people (232 males, 251 females) loaded from .fam.
Using 1 thread (no multithreaded calculations invoked).
Before main variant filters, 483 founders and 0 nonfounders present.
Calculating allele frequencies... done.
Total genotyping rate is 0.999918.
540534 variants and 483 people pass filters and QC.
Note: No phenotypes present.
Pruned 20224 variants from chromosome 1, leaving 21117.
Pruned 20957 variants from chromosome 2, leaving 20906.
Pruned 17466 variants from chromosome 3, leaving 17809.
Pruned 16341 variants from chromosome 4, leaving 16693.
Pruned 15054 variants from chromosome 5, leaving 15954.
Pruned 20739 variants from chromosome 6, leaving 15467.
Pruned 14034 variants from chromosome 7, leaving 14618.
Pruned 13299 variants from chromosome 8, leaving 13657.
Pruned 10739 variants from chromosome 9, leaving 11935.
Pruned 12750 variants from chromosome 10, leaving 13355.
Pruned 13044 variants from chromosome 11, leaving 12675.
Pruned 11666 variants from chromosome 12, leaving 12488.
Pruned 8436 variants from chromosome 13, leaving 9602.
Pruned 8080 variants from chromosome 14, leaving 8783.
Pruned 7960 variants from chromosome 15, leaving 8488.
Pruned 8829 variants from chromosome 16, leaving 9479.
Pruned 8038 variants from chromosome 17, leaving 8718.
Pruned 7203 variants from chromosome 18, leaving 8561.
Pruned 7014 variants from chromosome 19, leaving 6726.
Pruned 6352 variants from chromosome 20, leaving 7497.
Pruned 3638 variants from chromosome 21, leaving 4271.
Pruned 4083 variants from chromosome 22, leaving 4325.
Pruned 16243 variants from chromosome 23, leaving 5221.
Pruning complete.  272189 of 540534 variants removed.
Marker lists written to EUR.prune.in and EUR.prune.out .


$ ~/Programs/plink1.9/plink --bfile EUR.QC --extract EUR.prune.in --pca 6 --out EUR
PLINK v1.90b7 64-bit (16 Jan 2023)             www.cog-genomics.org/plink/1.9/
(C) 2005-2023 Shaun Purcell, Christopher Chang   GNU General Public License v3
Logging to EUR.log.
Options in effect:
  --bfile EUR.QC
  --extract EUR.prune.in
  --out EUR
  --pca 6

19690 MB RAM detected; reserving 9845 MB for main workspace.
540534 variants loaded from .bim file.
483 people (232 males, 251 females) loaded from .fam.
--extract: 268345 variants remaining.
Using up to 4 threads (change this with --threads).
Before main variant filters, 483 founders and 0 nonfounders present.
Calculating allele frequencies... done.
Total genotyping rate is 0.999957.
268345 variants and 483 people pass filters and QC.
Note: No phenotypes present.
Excluding 5221 variants on non-autosomes from relationship matrix calc.
Relationship matrix calculation complete.
--pca: Results saved to EUR.eigenval and EUR.eigenvec .

find the best-fit PRS

> p.threshold <- c(0.001,0.05,0.1,0.2,0.3,0.4,0.5)
> phenotype <- read.table("EUR.height", header=T)
 <- read.table("EUR.eigenvec", header=F)
colnames(> pcs <- read.table("EUR.eigenvec", header=F)
> colnames(pcs) <- c("FID", "IID", paste0("PC",1:6)) 
> covariate <- read.table("EUR.cov", header=T)
> pheno <- merge(merge(phenotype, covariate, by=c("FID", "IID")), pcs, by=c("FID","IID"))
ight~., data=pheno[,!colnames(pheno)%in%c("FID","IID")]))$r.squared
prs.result <- NULL
for(i in p.th> null.r2 <- summary(lm(Height~., data=pheno[,!colnames(pheno)%in%c("FID","IID")]))$r.squared
reshold){
    pheno.prs <- merge(pheno, 
         > prs.result <- NULL
> for(i in p.threshold){
+     pheno.prs <- merge(pheno, 
+                         read.table(paste0("EUR.",i,".profile"), header=T)[,c("FID","IID", "SCORE")],+                         by=c("FID", "IID"))
+     model <- summary(lm(Height~., data=pheno.prs[,!colnames(pheno.prs)%in%c("FID","IID")]))
+     model.r2 <- model$r.squared
+     prs.r2 <- model.r2-null.r2
+     prs.coef <- model$coeff["SCORE",]
+     prs.result <- rbind(prs.result, 
+         data.frame(Threshold=i, R2=prs.r2, 
+                     P=as.numeric(prs.coef[4]), 
+                     BETA=as.numeric(prs.coef[1]),
+                     SE=as.numeric(prs.coef[2])))
+ }
,])> print(prs.result[which.max(prs.result$R2),])
  Threshold        R2           P     BETA       SE
5       0.3 0.1585864 6.11268e-25 45113.72 4123.033
> prs.result
  Threshold         R2            P     BETA        SE
1     0.001 0.08989158 3.806105e-14  6667.59  853.5029
2     0.050 0.14252679 2.557276e-22 22741.25 2220.8554
3     0.100 0.14979089 1.698300e-23 29392.77 2783.7553
4     0.200 0.15246095 6.220157e-24 37819.97 3542.7541
5     0.300 0.15858638 6.112680e-25 45113.72 4123.0326
6     0.400 0.15750857 9.209118e-25 50228.65 4610.2216
7     0.500 0.15795704 7.765965e-25 54964.53 5035.8968

Published in categories Note