Yan Holtz, Zhihong Zhu, Julanne Frater, Perry Bartlett, Jian Yang, John McGrath




The previous part of this project run a GWAS on vitamin D blood concentration and found out 6 loci significantly associated. This part aims to annotate these regions in order to list putative underlying genes.

library(tidyverse)
library(qqman)
library(readr)
library(DT)
library(RColorBrewer)
library(ggrepel)

Genes in the region


The Locus Zoom online tool allows to visualise the regions of interest and see the underlying genes. Just put the .linear result of PLINK on it, and specify the SNP of interest.

Results for the 6 significant associations:



rs3755967

rs12785878

rs10741657

rs17216707

rs10745742

rs8018720

Genome-Wide Gene-based Association Study (GWGAS)


This step is computed using the MAGMA pipeline that is run through the FUMA online platform. Here is what it does:

Here is the Manhattan plot showing the p-value of each gene. Genome wide significance (red dashed line in the plot) was defined at P = 0.05/17999 = 2.778e-6.



img



Number of significantly associated genes: 22. A list of these 22 genes can be found using the download button of FUMA and reading genes.txt. This list of gene is going to be used for several of the following steps.

Note: Connection to the FUMA online platform: go to the website / Account: y.holtz@uq.edu.au / Pass: fumayanholtz.

Note: This step is computed with the SNP2GENE utility of the FUMA platform.

Where are these 22 genes expressed?


The previous GWGAS step identified 22 genes potentially involved in Vitamin-D concentration. The Gene2Func utility of the FUMA platform allows to check where these genes are expressed the most. It uses the GTEX dataset that provides expression data for all genes in 53 different tissue types:



img

A few observation:

Differentially expressed genes (DEG)


This part follows the previous one. If I understood well here is how it works:



img



A few observation:

Where are the SNPs more expressed


The previous approach considered only the 22 significant genes detected using the GWGAS method. The MAGMA pipeline offers another approach that use the full distribution of SNP p-values to detect tissues enriched for vitamin D concentration. Here is the result:



img

Observation:

Gene-Set analysis


The MsigDB is a database that lists about 10,000 groups of gene involved in the same pathways. For instance, the pathway called go_favonoid_metabolic_process is composed by 24 genes and contols the flavonoid metabolic process.

It is possible to test wether our 22 genes are often involved in some of these pathways. In our case, no significant pathway seems to be enriched. Here are 4 pathways that seem to have a little trend, but do not pass multiple testing correction.

Functional annotation of SNPs


The vitamin-D GWAS currently available show 6 independant loci. These loci are composed by 1875 SNPs. Each SNP has a position. Using the genome annotation, it is possible to count what are the functional annotations of SNPs that are the most frequent. FUMA allows to load a file called snps.txt that present these results. Let’s load it and visualize the data.

# Packages
library(tidyverse)
library(hrbrthemes)
library(patchwork)

# read data
data <- read.table("0_DATA/snps.txt", header=T)

# Plot 1: functionnal annotation:
p1 <- data %>%
  filter(!is.na(func)) %>%
  filter(!func %in% c("upstream:downstream:upstream", "upstream:upstream:downstream" )) %>%
  group_by(func) %>%
  summarize(n=n()) %>%
  arrange(n) %>%
  mutate(func=factor(func, func)) %>%
  ggplot(aes(x=func, y=n)) +
    geom_bar(stat="identity", alpha=0.6) +
    coord_flip() +
    theme_ipsum() +
    xlab("") +
    ylab("") +
    ggtitle("Functionnal Annotation") +
    theme(plot.title = element_text(size=10))

# Plot 2: regulome DB score:
p2 <- data %>%
  filter(!is.na(RDB)) %>%
  group_by(RDB) %>%
  summarize(n=n()) %>%
  arrange(desc(RDB)) %>%
  mutate(RDB=factor(RDB,RDB)) %>%
  ggplot(aes(x=RDB, y=n)) +
    geom_bar(stat="identity", alpha=0.6) +
    coord_flip() +
    theme_ipsum() +
    xlab("") +
    ylab("") +
    ggtitle("RegulomeDB score") +
    theme(plot.title = element_text(size=10))

# Plot 3: chromatin state:
p3 <- data %>%
  group_by(minChrState) %>%
  summarize(n=n()) %>%
  arrange(desc(minChrState)) %>%
  mutate(minChrState = factor(minChrState, minChrState)) %>%
  ggplot(aes(x=as.factor(minChrState), y=n)) +
    geom_bar(stat="identity", alpha=0.6) +
    coord_flip() +
    theme_ipsum()  +
    xlab("") +
    ylab("") +
    ggtitle("Chromatine state") +
    theme(plot.title = element_text(size=10))

p1 + p2 + p3


Explanation:


note - I’m not sure why this result is interesting. Indeed, we would probably need to compare it with the proportion of each class in the complete genome. For instance, intergenic and intronic annotations are probably the most common ones in the genome, so it is not a surprise if I observe it so often, is it?

note - The tool used to assign each SNP to a class is called ANNOVAR.

Other traits controled by the same loci:


The GWAS catalog is a catalog of thounsands of GWAS. FUMA allows to take all our significant SNPs and check if they have already been detected for another study. It appears that:

A note on FUMA


FUMA is a platform that can be used to annotate, prioritize, visualize and interpret GWAS results. It offers 2 main functions:

Note - Connection to the FUMA online platform: go to the website / Account: y.holtz@uq.edu.au / Pass: fumayanholtz.

Citation -

Input - as input it takes a GWAS result. Transfert the Plink results locally:

cd /Users/y.holtz/Desktop
scp y.holtz@delta.imb.uq.edu.au:/home/y.holtz/BLOOD_GWAS/1_GWAS/PLINK/result_GWAS_bloodpressure_plink.linear.gz .

Other: Genes in the region with IGV


Another possiblity suggested by John would be to use the IGV software. Download is possible here. It looks like this tool is more adapted for people working at the molecular level? Like using NGS and reads?

 

A work by Yan Holtz

Yan.holtz.data@gmail.com