1 Citation

Hai Fang, The ULTRA-DD Consortium, Julian C Knight. Pi: an R/Bioconductor package leveraging genetic evidence to prioritise drug targets at the gene and pathway level. Bioconductor (2020); doi:10.18129/B9.bioc.Pi

2 Installation

2.2 Packages

install.packages("BiocManager",dependencies=T)

# stable release version from Bioconductor
BiocManager::install("Pi", dependencies=T)

# reload the installed package
detach(package:Pi, unload=T)
library(Pi)

3 Overview

Pi pipeline. Seed genes defined from input GWAS summary data using scores for genomic predictors to determine a gene (denoted by circle) being functionally linked to the input disease associated genetic variant (denoted by triangle) based on proximity, conformation and expression, each represented as different pie segments; scores for annotation predictors (immune function/phenotype/disease) then only applied to such seed genes. Knowledge of network connectivity defines non-seed genes, with an example showing how network connectivity with highly prioritised seed genes can identify a non-seed gene (TNF) in rheumatoid arthritis. Predictor matrix generates numerical Pi prioritisation rating (scored 0-5) and ranking (out of ~15,000 genes) with affinity scores ensuring different predictors comparable.

Figure 1: Pi pipeline
Seed genes defined from input GWAS summary data using scores for genomic predictors to determine a gene (denoted by circle) being functionally linked to the input disease associated genetic variant (denoted by triangle) based on proximity, conformation and expression, each represented as different pie segments; scores for annotation predictors (immune function/phenotype/disease) then only applied to such seed genes. Knowledge of network connectivity defines non-seed genes, with an example showing how network connectivity with highly prioritised seed genes can identify a non-seed gene (TNF) in rheumatoid arthritis. Predictor matrix generates numerical Pi prioritisation rating (scored 0-5) and ranking (out of ~15,000 genes) with affinity scores ensuring different predictors comparable.

4 Functions

Pi functions. Designed in a nested way following this route: *`xPierSNPsAdv`* or *`xPierSNPsAdvABF`* -> *`xPierSNPs`* -> *`xPierGenes`* -> *`xPier`* -> *`xRWR`* to prepare a gene-predict matrix and prioritisation of targe genes (`xPierMatrix`) taking as inputs GWAS summary statistics/data for specific traits. The output of this route is further taken as inputs for target pathway prioritisation (*`xPierPathways`*) or pathway crosstalk identification (*`xPierSubnet`*). Notably, the function `xPierSNPsAdv` is supposed to analyse GWAS summary data with disease-associated SNPs and the significance level provided only, while the function `xPierSNPsAdvABF` requires additional info (effect/other alleles, effect size, standard error, etc) from GWAS summary data enabling the eGene identification and effect estimation through GWAS-eQTL colocalisation analysis (specifically, coloc - Approximate Bayes Factor).

Figure 2: Pi functions
Designed in a nested way following this route: xPierSNPsAdv or xPierSNPsAdvABF -> xPierSNPs -> xPierGenes -> xPier -> xRWR to prepare a gene-predict matrix and prioritisation of targe genes (xPierMatrix) taking as inputs GWAS summary statistics/data for specific traits. The output of this route is further taken as inputs for target pathway prioritisation (xPierPathways) or pathway crosstalk identification (xPierSubnet). Notably, the function xPierSNPsAdv is supposed to analyse GWAS summary data with disease-associated SNPs and the significance level provided only, while the function xPierSNPsAdvABF requires additional info (effect/other alleles, effect size, standard error, etc) from GWAS summary data enabling the eGene identification and effect estimation through GWAS-eQTL colocalisation analysis (specifically, coloc - Approximate Bayes Factor).

5 Getting Started

Using curated GWAS summary data for rheumatoid arthritis (RA) as inputs, this section gives a step-by-step demo showing how to use Pi to achieve disease-specific prioritisation of drug targets at the gene and pathway level

First of all, load the package and specify the location of built-in data.

library(Pi)
RData.location <- "http://galahad.well.ox.ac.uk/bigdata"

5.1 Input data

GWAS summary data for RA are primarily sourced from GWAS Catalog and ImmunoBase.

## make sure the package 'readr' installed
## BiocManager::install("readr")
data.file <- file.path(RData.location, "GWAS_summary_statistics.RA.txt.gz")
data <- readr::read_delim(data.file, delim="\t") %>% as.data.frame()
Table 1: The first 5 rows of data
The column snp for GWAS SNPs, p for GWAS-reported p-value, effect for the allele with effect assessed, other the other allele (major allele), b for the effect size, se for the standard error, and suggestive for the logic indicating SNPs (FALSE) only for nGene/cGene identification and scoring; however all SNPs (no matter TRUE or FALSE) used to define/score eGene.
snp p effect other b se suggestive
chr1:100007258 0.031 T C -0.1743534 0.08082870 TRUE
chr1:10000804 0.014 T C -0.5798185 0.23596107 TRUE
chr1:100008943 0.016 T G -0.2231436 0.09263237 TRUE
chr1:100016466 0.097 A G -0.1165338 0.07021908 TRUE
chr1:100029932 0.044 A G -0.2231436 0.11079121 TRUE

5.2 Target gene prioritisation

Parameters used

include.LD <- 'EUR'
LD.r2 <- 0.8
LD.customised <- NULL
significance.threshold <- 5e-8
score.cap <- 10
distance.max <- 20000
decay.kernel <- "constant"
decay.exponent <- 2
GR.SNP <- "dbSNP_GWAS"
#GR.SNP <- xRDataLoader('dbSNP_GWAS', RData.location=RData.location) # equivalent to this
GR.Gene <- "UCSC_knownGene"
#GR.Gene <- xRDataLoader('UCSC_knownGene', RData.location=RData.location) # equivalent to this
include.TAD <- "GM12878" # lymphoblast, reflective of immune-context genomic organisation
include.HiC <- c("Monocytes","Macrophages_M0","Macrophages_M1","Macrophages_M2","Neutrophils","Naive_CD4_T_cells","Total_CD4_T_cells","Naive_CD8_T_cells","Total_CD8_T_cells","Naive_B_cells","Total_B_cells")
include.eQTL <- c("Bcell","Blood","CD14","CD4","CD8","IFN","LPS24","LPS2","Neutrophil","NK")
eQTL.customised <- NULL
cdf.function <- "empirical"
scoring.scheme <- 'max'
network <- "STRING_high"
STRING.only <- NA
weighted <- FALSE
network.customised <- NULL
#network.customised <- xDefineNet("STRING_high", RData.location=RData.location)  # equivalent to this
seeds.inclusive <- TRUE
normalise <- "laplacian"
restart <- 0.7
normalise.affinity.matrix <- "none"
parallel <- TRUE
multicores <- NULL
verbose <- TRUE

Table 2: Promoter capture Hi-C datasets used to identify cGenes.
Code Name
Monocytes monocytes
Macrophages_M0 macrophages (M0)
Macrophages_M1 macrophages (M1)
Macrophages_M2 macrophages (M2)
Neutrophils neutrophils
Naive_CD4_T_cells CD4+ T cells (naïve)
Total_CD4_T_cells CD4+ T cells (total)
Naive_CD8_T_cells CD8+ T cells (naïve)
Total_CD8_T_cells CD8+ T cells (total)
Naive_B_cells B cells (naïve)
Total_B_cells B cells (total)

Table 3: The eQTL mapping datasets used to identify eGenes.
Code Name
CD14 CD14+ (monocytes)
LPS2 CD14+ by LPS2h
LPS24 CD14+ by LPS24h
IFN CD14+ by IFN24h
Bcell B cells
CD4 CD4+ T cells
CD8 CD8+ T cells
Neutrophil neutrophils
NK NK cells
Blood peripheral blood

Prioritisation

# prepare predictors
## first, genomic predictors
ls_pNode_genomic <- xPierSNPsAdvABF(data, include.LD=include.LD, LD.customised=LD.customised, LD.r2=LD.r2, significance.threshold=significance.threshold, score.cap=score.cap, distance.max=distance.max, decay.kernel=decay.kernel, decay.exponent=decay.exponent, GR.SNP=GR.SNP, GR.Gene=GR.Gene, include.TAD=include.TAD, include.eQTL=include.eQTL, include.HiC=include.HiC, cdf.function=cdf.function, scoring.scheme=scoring.scheme, network=network, STRING.only=STRING.only, weighted=weighted, network.customised=network.customised, seeds.inclusive=seeds.inclusive, normalise=normalise, restart=restart, normalise.affinity.matrix=normalise.affinity.matrix, parallel=parallel, multicores=multicores, verbose=verbose, RData.location=RData.location)

## then, annotation predictors
data.file <- file.path(RData.location, "iAnno.txt")
iA <- read.delim(data.file, header=TRUE, stringsAsFactors=FALSE)[,c("Symbol","OMIM","Phenotype","Function")]
colnames(iA) <- c("Symbol","dGene","pGene","fGene")
ls_pNode_anno <- lapply(2:4, function(j){
    data_anno <- subset(data.frame(seed=iA$Symbol,weight=iA[,j],stringsAsFactors=F), weight>0)
    pNode <- xPierAnno(data_anno, list_pNode=ls_pNode_genomic, network=network, STRING.only=STRING.only, weighted=weighted, network.customised=network.customised, seeds.inclusive=seeds.inclusive, normalise=normalise, restart=restart, normalise.affinity.matrix=normalise.affinity.matrix, parallel=parallel, multicores=multicores, verbose=verbose, RData.location=RData.location)
})
names(ls_pNode_anno) <- colnames(iA)[2:4]
## bring together both predictors
ls_pNode <- c(ls_pNode_anno, ls_pNode_genomic)

# Prioritisation in a discovery mode
dTarget <- xPierMatrix(ls_pNode, displayBy="pvalue", aggregateBy="fishers", RData.location=RData.location)

The results are stored in an object called dTarget, a list with components including a data frame priority.

dTarget
## An object of S3 class 'dTarget', with 4 components:
##   $metag: an igraph object with 15432 nodes and 313918 edges
##   $predictor: a data frame of 15432 rows X 25 columns
##   $priority: a data frame of 15432 rows X 11 columns
##   $list_pNode: a list of 25 'pNode' objects
## 
## --------------------------------------------------
## $priority:
##      name rank  rating                                            description
##  HLA-DQA1    1 5.00000 major histocompatibility complex, class II, DQ alpha 1
##  HLA-DQB1    2 4.99281  major histocompatibility complex, class II, DQ beta 1
##  seed nGene cGene eGene dGene pGene fGene
##     Y     1     9     0     1     1     1
##     Y     1    11     1     1     1     1
## ......

The data frame dTarget$priority can be written into a text file Pi_output_priority.txt:

write.table(dTarget$priority, file="Pi_output_priority.txt", sep="\t", row.names=FALSE)

Manhattan plot

gp_manhattan <- xPierManhattan(dTarget, color='rainbow_hcl', top=30, top.label.type="text", top.label.size=3, top.label.col="black", y.lab="Pi rating", RData.location=RData.location)
gp_manhattan
Manhattan plot. Priority scores (Pi rating) for genes are displayed on the Y-axis and genomic locations on the X-axis, with top 30 genes named.

Figure 3: Manhattan plot
Priority scores (Pi rating) for genes are displayed on the Y-axis and genomic locations on the X-axis, with top 30 genes named.

5.3 Target pathway prioritisation

Pathway-level prioritisation is based on enrichment analysis of top 1% prioritised genes (top 150 genes) using a compendium of immune and signal transduction pathways (sourced from REACTOME).

priority.top <- 150
eTerm <- xPierPathways(dTarget, priority.top=priority.top, ontology="REACTOME", size.range=c(15,1500), test="fisher", min.overlap=10, RData.location=RData.location)
df_eTerm <- subset(xEnrichViewer(eTerm,'all',details=T), namespace %in% c('Immune System','Signal Transduction'))
gp_reactome <- xEnrichForest(df_eTerm, zlim=c(0,30))
gp_reactome
Forest plot of prioritised REACTOME pathways.

Figure 4: Forest plot of prioritised REACTOME pathways

5.4 Target pathway crosstalk

An algorithm is developed to search for a subset of a gene network (merged from KEGG pathways) in a way that the resulting gene subnetwork (or crosstalk between different pathways) contains highly prioritised genes with a few less prioritised genes as linkers.

# obtain a gene network merged from KEGG pathways
ls_ig <- lapply(c("KEGG_environmental","KEGG_organismal"), function(network){
    g <- xDefineNet(network=network, RData.location=RData.location)
})
ig <- xCombineNet(ls_ig)

# search for a subset (with a desired node number [30,35]) of the gene network 
g <- xPierSubnet(dTarget, network.customised=ig, subnet.size=33, RData.location=RData.location)

# visualise pathway crosstalk
gp_rating_evidence <- xVisEvidenceAdv(dTarget, nodes=V(g)$name, g=g, colormap="jet.top")
gp_rating_evidence
Visualisation of pathway crosstalk. Nodes/genes colored by Pi rating and labelled with evidence (the types of seed genes; colored segments).

Figure 5: Visualisation of pathway crosstalk
Nodes/genes colored by Pi rating and labelled with evidence (the types of seed genes; colored segments).