Instructions on how to use an R/Bioconductor package "Pi", a genetics-led system for the discovery and prioritisation of drug targets at the gene and pathway level. Taking as inputs GWAS summary statistics for immune-related traits/diseases, Pi uses genomic predictors
to identify and score genomic seed genes
likely responsible for these genetic associations, including nearby genes nGene
based on genomic proximity and genomic organisation; chromatin conformation genes cGene
based on summary data produced from promoter capture Hi-C studies; and expression-associated genes eGene
based on summary data produced from eQTL mapping. For eGene identification and scoring, Pi supports GWAS-eQTL colocalisation analysis enabling directionality and magnitude of effect integration into the output. Restricted to genomic seed genes
, Pi additionally uses annotation predictors
related to immune function and dysfunction, including function genes fGene
, phenotype gene pGene
and disease gene dGene
. Pi further identifies non-seed genes
under network influence using the random walk with restart algorithm, that is, identification of non-seed genes based on network connectivity/affinity of gene interaction information to seed genes. As such, a gene-predictor matrix that combines genomic and annotation predictors with knowledge of affinity scores is constructed given GWAS summary statistics. The prioritisation of target genes is enabled through meta-analysis (discovery mode; without machine learning) to prioritise target genes with substantial genetic support and/or with network evidence. The prioritisation of target pathways individually and at crosstalk is based on the prioritised target genes.
Pi 2.0.2
Hai Fang, The ULTRA-DD Consortium, Julian C Knight. Pi: an R/Bioconductor package leveraging genetic evidence to prioritise drug targets at the gene and pathway level. Bioconductor (2020); doi:10.18129/B9.bioc.Pi
install.packages("BiocManager",dependencies=T)
# stable release version from Bioconductor
BiocManager::install("Pi", dependencies=T)
# reload the installed package
detach(package:Pi, unload=T)
library(Pi)
Using curated GWAS summary data for rheumatoid arthritis (RA) as inputs, this section gives a step-by-step demo showing how to use Pi
to achieve disease-specific prioritisation of drug targets at the gene and pathway level
First of all, load the package and specify the location of built-in data.
library(Pi)
RData.location <- "http://galahad.well.ox.ac.uk/bigdata"
GWAS summary data for RA are primarily sourced from GWAS Catalog and ImmunoBase.
## make sure the package 'readr' installed
## BiocManager::install("readr")
data.file <- file.path(RData.location, "GWAS_summary_statistics.RA.txt.gz")
data <- readr::read_delim(data.file, delim="\t") %>% as.data.frame()
snp | p | effect | other | b | se | suggestive |
---|---|---|---|---|---|---|
chr1:100007258 | 0.031 | T | C | -0.1743534 | 0.08082870 | TRUE |
chr1:10000804 | 0.014 | T | C | -0.5798185 | 0.23596107 | TRUE |
chr1:100008943 | 0.016 | T | G | -0.2231436 | 0.09263237 | TRUE |
chr1:100016466 | 0.097 | A | G | -0.1165338 | 0.07021908 | TRUE |
chr1:100029932 | 0.044 | A | G | -0.2231436 | 0.11079121 | TRUE |
Parameters used
include.LD <- 'EUR'
LD.r2 <- 0.8
LD.customised <- NULL
significance.threshold <- 5e-8
score.cap <- 10
distance.max <- 20000
decay.kernel <- "constant"
decay.exponent <- 2
GR.SNP <- "dbSNP_GWAS"
#GR.SNP <- xRDataLoader('dbSNP_GWAS', RData.location=RData.location) # equivalent to this
GR.Gene <- "UCSC_knownGene"
#GR.Gene <- xRDataLoader('UCSC_knownGene', RData.location=RData.location) # equivalent to this
include.TAD <- "GM12878" # lymphoblast, reflective of immune-context genomic organisation
include.HiC <- c("Monocytes","Macrophages_M0","Macrophages_M1","Macrophages_M2","Neutrophils","Naive_CD4_T_cells","Total_CD4_T_cells","Naive_CD8_T_cells","Total_CD8_T_cells","Naive_B_cells","Total_B_cells")
include.eQTL <- c("Bcell","Blood","CD14","CD4","CD8","IFN","LPS24","LPS2","Neutrophil","NK")
eQTL.customised <- NULL
cdf.function <- "empirical"
scoring.scheme <- 'max'
network <- "STRING_high"
STRING.only <- NA
weighted <- FALSE
network.customised <- NULL
#network.customised <- xDefineNet("STRING_high", RData.location=RData.location) # equivalent to this
seeds.inclusive <- TRUE
normalise <- "laplacian"
restart <- 0.7
normalise.affinity.matrix <- "none"
parallel <- TRUE
multicores <- NULL
verbose <- TRUE
Code | Name |
---|---|
Monocytes | monocytes |
Macrophages_M0 | macrophages (M0) |
Macrophages_M1 | macrophages (M1) |
Macrophages_M2 | macrophages (M2) |
Neutrophils | neutrophils |
Naive_CD4_T_cells | CD4+ T cells (naïve) |
Total_CD4_T_cells | CD4+ T cells (total) |
Naive_CD8_T_cells | CD8+ T cells (naïve) |
Total_CD8_T_cells | CD8+ T cells (total) |
Naive_B_cells | B cells (naïve) |
Total_B_cells | B cells (total) |
Code | Name |
---|---|
CD14 | CD14+ (monocytes) |
LPS2 | CD14+ by LPS2h |
LPS24 | CD14+ by LPS24h |
IFN | CD14+ by IFN24h |
Bcell | B cells |
CD4 | CD4+ T cells |
CD8 | CD8+ T cells |
Neutrophil | neutrophils |
NK | NK cells |
Blood | peripheral blood |
Prioritisation
# prepare predictors
## first, genomic predictors
ls_pNode_genomic <- xPierSNPsAdvABF(data, include.LD=include.LD, LD.customised=LD.customised, LD.r2=LD.r2, significance.threshold=significance.threshold, score.cap=score.cap, distance.max=distance.max, decay.kernel=decay.kernel, decay.exponent=decay.exponent, GR.SNP=GR.SNP, GR.Gene=GR.Gene, include.TAD=include.TAD, include.eQTL=include.eQTL, include.HiC=include.HiC, cdf.function=cdf.function, scoring.scheme=scoring.scheme, network=network, STRING.only=STRING.only, weighted=weighted, network.customised=network.customised, seeds.inclusive=seeds.inclusive, normalise=normalise, restart=restart, normalise.affinity.matrix=normalise.affinity.matrix, parallel=parallel, multicores=multicores, verbose=verbose, RData.location=RData.location)
## then, annotation predictors
data.file <- file.path(RData.location, "iAnno.txt")
iA <- read.delim(data.file, header=TRUE, stringsAsFactors=FALSE)[,c("Symbol","OMIM","Phenotype","Function")]
colnames(iA) <- c("Symbol","dGene","pGene","fGene")
ls_pNode_anno <- lapply(2:4, function(j){
data_anno <- subset(data.frame(seed=iA$Symbol,weight=iA[,j],stringsAsFactors=F), weight>0)
pNode <- xPierAnno(data_anno, list_pNode=ls_pNode_genomic, network=network, STRING.only=STRING.only, weighted=weighted, network.customised=network.customised, seeds.inclusive=seeds.inclusive, normalise=normalise, restart=restart, normalise.affinity.matrix=normalise.affinity.matrix, parallel=parallel, multicores=multicores, verbose=verbose, RData.location=RData.location)
})
names(ls_pNode_anno) <- colnames(iA)[2:4]
## bring together both predictors
ls_pNode <- c(ls_pNode_anno, ls_pNode_genomic)
# Prioritisation in a discovery mode
dTarget <- xPierMatrix(ls_pNode, displayBy="pvalue", aggregateBy="fishers", RData.location=RData.location)
The results are stored in an object called dTarget
, a list with components including a data frame priority
.
dTarget
## An object of S3 class 'dTarget', with 4 components:
## $metag: an igraph object with 15432 nodes and 313918 edges
## $predictor: a data frame of 15432 rows X 25 columns
## $priority: a data frame of 15432 rows X 11 columns
## $list_pNode: a list of 25 'pNode' objects
##
## --------------------------------------------------
## $priority:
## name rank rating description
## HLA-DQA1 1 5.00000 major histocompatibility complex, class II, DQ alpha 1
## HLA-DQB1 2 4.99281 major histocompatibility complex, class II, DQ beta 1
## seed nGene cGene eGene dGene pGene fGene
## Y 1 9 0 1 1 1
## Y 1 11 1 1 1 1
## ......
The data frame dTarget$priority
can be written into a text file Pi_output_priority.txt
:
write.table(dTarget$priority, file="Pi_output_priority.txt", sep="\t", row.names=FALSE)
Manhattan plot
gp_manhattan <- xPierManhattan(dTarget, color='rainbow_hcl', top=30, top.label.type="text", top.label.size=3, top.label.col="black", y.lab="Pi rating", RData.location=RData.location)
gp_manhattan
Pathway-level prioritisation is based on enrichment analysis of top 1% prioritised genes (top 150 genes) using a compendium of immune and signal transduction pathways (sourced from REACTOME).
priority.top <- 150
eTerm <- xPierPathways(dTarget, priority.top=priority.top, ontology="REACTOME", size.range=c(15,1500), test="fisher", min.overlap=10, RData.location=RData.location)
df_eTerm <- subset(xEnrichViewer(eTerm,'all',details=T), namespace %in% c('Immune System','Signal Transduction'))
gp_reactome <- xEnrichForest(df_eTerm, zlim=c(0,30))
gp_reactome
An algorithm is developed to search for a subset of a gene network (merged from KEGG pathways) in a way that the resulting gene subnetwork (or crosstalk between different pathways) contains highly prioritised genes with a few less prioritised genes as linkers.
# obtain a gene network merged from KEGG pathways
ls_ig <- lapply(c("KEGG_environmental","KEGG_organismal"), function(network){
g <- xDefineNet(network=network, RData.location=RData.location)
})
ig <- xCombineNet(ls_ig)
# search for a subset (with a desired node number [30,35]) of the gene network
g <- xPierSubnet(dTarget, network.customised=ig, subnet.size=33, RData.location=RData.location)
# visualise pathway crosstalk
gp_rating_evidence <- xVisEvidenceAdv(dTarget, nodes=V(g)$name, g=g, colormap="jet.top")
gp_rating_evidence