Instructions on how to use an R/Bioconductor package "Pi", a genetics-led system for the discovery and prioritisation of drug targets at the gene and pathway level. Taking as inputs GWAS summary statistics for immune-related traits/diseases, Pi uses genomic predictors to identify and score genomic seed genes likely responsible for these genetic associations, including nearby genes nGene based on genomic proximity and genomic organisation; chromatin conformation genes cGene based on summary data produced from promoter capture Hi-C studies; and expression-associated genes eGene based on summary data produced from eQTL mapping. For eGene identification and scoring, Pi supports GWAS-eQTL colocalisation analysis enabling directionality and magnitude of effect integration into the output. Restricted to genomic seed genes, Pi additionally uses annotation predictors related to immune function and dysfunction, including function genes fGene, phenotype gene pGene and disease gene dGene. Pi further identifies non-seed genes under network influence using the random walk with restart algorithm, that is, identification of non-seed genes based on network connectivity/affinity of gene interaction information to seed genes. As such, a gene-predictor matrix that combines genomic and annotation predictors with knowledge of affinity scores is constructed given GWAS summary statistics. The prioritisation of target genes is enabled through meta-analysis (discovery mode; without machine learning) to prioritise target genes with substantial genetic support and/or with network evidence. The prioritisation of target pathways individually and at crosstalk is based on the prioritised target genes.
Pi 2.0.2
Hai Fang, The ULTRA-DD Consortium, Julian C Knight. Pi: an R/Bioconductor package leveraging genetic evidence to prioritise drug targets at the gene and pathway level. Bioconductor (2020); doi:10.18129/B9.bioc.Pi
install.packages("BiocManager",dependencies=T)
# stable release version from Bioconductor
BiocManager::install("Pi", dependencies=T)
# reload the installed package
detach(package:Pi, unload=T)
library(Pi)
Figure 1: Pi pipeline
Seed genes defined from input GWAS summary data using scores for genomic predictors to determine a gene (denoted by circle) being functionally linked to the input disease associated genetic variant (denoted by triangle) based on proximity, conformation and expression, each represented as different pie segments; scores for annotation predictors (immune function/phenotype/disease) then only applied to such seed genes. Knowledge of network connectivity defines non-seed genes, with an example showing how network connectivity with highly prioritised seed genes can identify a non-seed gene (TNF) in rheumatoid arthritis. Predictor matrix generates numerical Pi prioritisation rating (scored 0-5) and ranking (out of ~15,000 genes) with affinity scores ensuring different predictors comparable.
Figure 2: Pi functions
Designed in a nested way following this route: xPierSNPsAdv or xPierSNPsAdvABF -> xPierSNPs -> xPierGenes -> xPier -> xRWR to prepare a gene-predict matrix and prioritisation of targe genes (xPierMatrix) taking as inputs GWAS summary statistics/data for specific traits. The output of this route is further taken as inputs for target pathway prioritisation (xPierPathways) or pathway crosstalk identification (xPierSubnet). Notably, the function xPierSNPsAdv is supposed to analyse GWAS summary data with disease-associated SNPs and the significance level provided only, while the function xPierSNPsAdvABF requires additional info (effect/other alleles, effect size, standard error, etc) from GWAS summary data enabling the eGene identification and effect estimation through GWAS-eQTL colocalisation analysis (specifically, coloc - Approximate Bayes Factor).
Using curated GWAS summary data for rheumatoid arthritis (RA) as inputs, this section gives a step-by-step demo showing how to use Pi to achieve disease-specific prioritisation of drug targets at the gene and pathway level
First of all, load the package and specify the location of built-in data.
library(Pi)
RData.location <- "http://galahad.well.ox.ac.uk/bigdata"
GWAS summary data for RA are primarily sourced from GWAS Catalog and ImmunoBase.
## make sure the package 'readr' installed
## BiocManager::install("readr")
data.file <- file.path(RData.location, "GWAS_summary_statistics.RA.txt.gz")
data <- readr::read_delim(data.file, delim="\t") %>% as.data.frame()
| snp | p | effect | other | b | se | suggestive |
|---|---|---|---|---|---|---|
| chr1:100007258 | 0.031 | T | C | -0.1743534 | 0.08082870 | TRUE |
| chr1:10000804 | 0.014 | T | C | -0.5798185 | 0.23596107 | TRUE |
| chr1:100008943 | 0.016 | T | G | -0.2231436 | 0.09263237 | TRUE |
| chr1:100016466 | 0.097 | A | G | -0.1165338 | 0.07021908 | TRUE |
| chr1:100029932 | 0.044 | A | G | -0.2231436 | 0.11079121 | TRUE |
Parameters used
include.LD <- 'EUR'
LD.r2 <- 0.8
LD.customised <- NULL
significance.threshold <- 5e-8
score.cap <- 10
distance.max <- 20000
decay.kernel <- "constant"
decay.exponent <- 2
GR.SNP <- "dbSNP_GWAS"
#GR.SNP <- xRDataLoader('dbSNP_GWAS', RData.location=RData.location) # equivalent to this
GR.Gene <- "UCSC_knownGene"
#GR.Gene <- xRDataLoader('UCSC_knownGene', RData.location=RData.location) # equivalent to this
include.TAD <- "GM12878" # lymphoblast, reflective of immune-context genomic organisation
include.HiC <- c("Monocytes","Macrophages_M0","Macrophages_M1","Macrophages_M2","Neutrophils","Naive_CD4_T_cells","Total_CD4_T_cells","Naive_CD8_T_cells","Total_CD8_T_cells","Naive_B_cells","Total_B_cells")
include.eQTL <- c("Bcell","Blood","CD14","CD4","CD8","IFN","LPS24","LPS2","Neutrophil","NK")
eQTL.customised <- NULL
cdf.function <- "empirical"
scoring.scheme <- 'max'
network <- "STRING_high"
STRING.only <- NA
weighted <- FALSE
network.customised <- NULL
#network.customised <- xDefineNet("STRING_high", RData.location=RData.location) # equivalent to this
seeds.inclusive <- TRUE
normalise <- "laplacian"
restart <- 0.7
normalise.affinity.matrix <- "none"
parallel <- TRUE
multicores <- NULL
verbose <- TRUE
| Code | Name |
|---|---|
| Monocytes | monocytes |
| Macrophages_M0 | macrophages (M0) |
| Macrophages_M1 | macrophages (M1) |
| Macrophages_M2 | macrophages (M2) |
| Neutrophils | neutrophils |
| Naive_CD4_T_cells | CD4+ T cells (naïve) |
| Total_CD4_T_cells | CD4+ T cells (total) |
| Naive_CD8_T_cells | CD8+ T cells (naïve) |
| Total_CD8_T_cells | CD8+ T cells (total) |
| Naive_B_cells | B cells (naïve) |
| Total_B_cells | B cells (total) |
| Code | Name |
|---|---|
| CD14 | CD14+ (monocytes) |
| LPS2 | CD14+ by LPS2h |
| LPS24 | CD14+ by LPS24h |
| IFN | CD14+ by IFN24h |
| Bcell | B cells |
| CD4 | CD4+ T cells |
| CD8 | CD8+ T cells |
| Neutrophil | neutrophils |
| NK | NK cells |
| Blood | peripheral blood |
Prioritisation
# prepare predictors
## first, genomic predictors
ls_pNode_genomic <- xPierSNPsAdvABF(data, include.LD=include.LD, LD.customised=LD.customised, LD.r2=LD.r2, significance.threshold=significance.threshold, score.cap=score.cap, distance.max=distance.max, decay.kernel=decay.kernel, decay.exponent=decay.exponent, GR.SNP=GR.SNP, GR.Gene=GR.Gene, include.TAD=include.TAD, include.eQTL=include.eQTL, include.HiC=include.HiC, cdf.function=cdf.function, scoring.scheme=scoring.scheme, network=network, STRING.only=STRING.only, weighted=weighted, network.customised=network.customised, seeds.inclusive=seeds.inclusive, normalise=normalise, restart=restart, normalise.affinity.matrix=normalise.affinity.matrix, parallel=parallel, multicores=multicores, verbose=verbose, RData.location=RData.location)
## then, annotation predictors
data.file <- file.path(RData.location, "iAnno.txt")
iA <- read.delim(data.file, header=TRUE, stringsAsFactors=FALSE)[,c("Symbol","OMIM","Phenotype","Function")]
colnames(iA) <- c("Symbol","dGene","pGene","fGene")
ls_pNode_anno <- lapply(2:4, function(j){
data_anno <- subset(data.frame(seed=iA$Symbol,weight=iA[,j],stringsAsFactors=F), weight>0)
pNode <- xPierAnno(data_anno, list_pNode=ls_pNode_genomic, network=network, STRING.only=STRING.only, weighted=weighted, network.customised=network.customised, seeds.inclusive=seeds.inclusive, normalise=normalise, restart=restart, normalise.affinity.matrix=normalise.affinity.matrix, parallel=parallel, multicores=multicores, verbose=verbose, RData.location=RData.location)
})
names(ls_pNode_anno) <- colnames(iA)[2:4]
## bring together both predictors
ls_pNode <- c(ls_pNode_anno, ls_pNode_genomic)
# Prioritisation in a discovery mode
dTarget <- xPierMatrix(ls_pNode, displayBy="pvalue", aggregateBy="fishers", RData.location=RData.location)
The results are stored in an object called dTarget, a list with components including a data frame priority.
dTarget
## An object of S3 class 'dTarget', with 4 components:
## $metag: an igraph object with 15432 nodes and 313918 edges
## $predictor: a data frame of 15432 rows X 25 columns
## $priority: a data frame of 15432 rows X 11 columns
## $list_pNode: a list of 25 'pNode' objects
##
## --------------------------------------------------
## $priority:
## name rank rating description
## HLA-DQA1 1 5.00000 major histocompatibility complex, class II, DQ alpha 1
## HLA-DQB1 2 4.99281 major histocompatibility complex, class II, DQ beta 1
## seed nGene cGene eGene dGene pGene fGene
## Y 1 9 0 1 1 1
## Y 1 11 1 1 1 1
## ......
The data frame dTarget$priority can be written into a text file Pi_output_priority.txt:
write.table(dTarget$priority, file="Pi_output_priority.txt", sep="\t", row.names=FALSE)
Manhattan plot
gp_manhattan <- xPierManhattan(dTarget, color='rainbow_hcl', top=30, top.label.type="text", top.label.size=3, top.label.col="black", y.lab="Pi rating", RData.location=RData.location)
gp_manhattan
Figure 3: Manhattan plot
Priority scores (Pi rating) for genes are displayed on the Y-axis and genomic locations on the X-axis, with top 30 genes named.
Pathway-level prioritisation is based on enrichment analysis of top 1% prioritised genes (top 150 genes) using a compendium of immune and signal transduction pathways (sourced from REACTOME).
priority.top <- 150
eTerm <- xPierPathways(dTarget, priority.top=priority.top, ontology="REACTOME", size.range=c(15,1500), test="fisher", min.overlap=10, RData.location=RData.location)
df_eTerm <- subset(xEnrichViewer(eTerm,'all',details=T), namespace %in% c('Immune System','Signal Transduction'))
gp_reactome <- xEnrichForest(df_eTerm, zlim=c(0,30))
gp_reactome
Figure 4: Forest plot of prioritised REACTOME pathways
An algorithm is developed to search for a subset of a gene network (merged from KEGG pathways) in a way that the resulting gene subnetwork (or crosstalk between different pathways) contains highly prioritised genes with a few less prioritised genes as linkers.
# obtain a gene network merged from KEGG pathways
ls_ig <- lapply(c("KEGG_environmental","KEGG_organismal"), function(network){
g <- xDefineNet(network=network, RData.location=RData.location)
})
ig <- xCombineNet(ls_ig)
# search for a subset (with a desired node number [30,35]) of the gene network
g <- xPierSubnet(dTarget, network.customised=ig, subnet.size=33, RData.location=RData.location)
# visualise pathway crosstalk
gp_rating_evidence <- xVisEvidenceAdv(dTarget, nodes=V(g)$name, g=g, colormap="jet.top")
gp_rating_evidence
Figure 5: Visualisation of pathway crosstalk
Nodes/genes colored by Pi rating and labelled with evidence (the types of seed genes; colored segments).