Pi User Manual

Hai Fang1*, The ULTRA-DD Consortium2 and Julian C Knight1**

1Wellcome Centre for Human Genetics, University of Oxford, Oxford OX3 7BN, UK
2https://ultra-dd.org

*hfang@well.ox.ac.uk
**julian@well.ox.ac.uk

2020-08-11

Abstract

Instructions on how to use an R/Bioconductor package "Pi", a genetics-led system for the discovery and prioritisation of drug targets at the gene and pathway level. Taking as inputs GWAS summary statistics for immune-related traits/diseases, Pi uses genomic predictors to identify and score genomic seed genes likely responsible for these genetic associations, including nearby genes nGene based on genomic proximity and genomic organisation; chromatin conformation genes cGene based on summary data produced from promoter capture Hi-C studies; and expression-associated genes eGene based on summary data produced from eQTL mapping. For eGene identification and scoring, Pi supports GWAS-eQTL colocalisation analysis enabling directionality and magnitude of effect integration into the output. Restricted to genomic seed genes, Pi additionally uses annotation predictors related to immune function and dysfunction, including function genes fGene, phenotype gene pGene and disease gene dGene. Pi further identifies non-seed genes under network influence using the random walk with restart algorithm, that is, identification of non-seed genes based on network connectivity/affinity of gene interaction information to seed genes. As such, a gene-predictor matrix that combines genomic and annotation predictors with knowledge of affinity scores is constructed given GWAS summary statistics. The prioritisation of target genes is enabled through meta-analysis (discovery mode; without machine learning) to prioritise target genes with substantial genetic support and/or with network evidence. The prioritisation of target pathways individually and at crosstalk is based on the prioritised target genes.

Package

Pi 2.0.2

1 Citation

Hai Fang, The ULTRA-DD Consortium, Julian C Knight. Pi: an R/Bioconductor package leveraging genetic evidence to prioritise drug targets at the gene and pathway level. Bioconductor (2020); doi:10.18129/B9.bioc.Pi

2 Installation

2.1 R

See https://www.r-project.org or http://bioconductor.org/install/#install-R

2.2 Packages

install.packages("BiocManager",dependencies=T)

# stable release version from Bioconductor
BiocManager::install("Pi", dependencies=T)

# reload the installed package
detach(package:Pi, unload=T)
library(Pi)

3 Overview

Pi pipeline. Seed genes defined from input GWAS summary data using scores for genomic predictors to determine a gene (denoted by circle) being functionally linked to the input disease associated genetic variant (denoted by triangle) based on proximity, conformation and expression, each represented as different pie segments; scores for annotation predictors (immune function/phenotype/disease) then only applied to such seed genes. Knowledge of network connectivity defines non-seed genes, with an example showing how network connectivity with highly prioritised seed genes can identify a non-seed gene (TNF) in rheumatoid arthritis. Predictor matrix generates numerical Pi prioritisation rating (scored 0-5) and ranking (out of ~15,000 genes) with affinity scores ensuring different predictors comparable.

Figure 1: Pi pipeline
Seed genes defined from input GWAS summary data using scores for genomic predictors to determine a gene (denoted by circle) being functionally linked to the input disease associated genetic variant (denoted by triangle) based on proximity, conformation and expression, each represented as different pie segments; scores for annotation predictors (immune function/phenotype/disease) then only applied to such seed genes. Knowledge of network connectivity defines non-seed genes, with an example showing how network connectivity with highly prioritised seed genes can identify a non-seed gene (TNF) in rheumatoid arthritis. Predictor matrix generates numerical Pi prioritisation rating (scored 0-5) and ranking (out of ~15,000 genes) with affinity scores ensuring different predictors comparable.

4 Functions

Pi functions. Designed in a nested way following this route: *`xPierSNPsAdv`* or *`xPierSNPsAdvABF`* -> *`xPierSNPs`* -> *`xPierGenes`* -> *`xPier`* -> *`xRWR`* to prepare a gene-predict matrix and prioritisation of targe genes (`xPierMatrix`) taking as inputs GWAS summary statistics/data for specific traits. The output of this route is further taken as inputs for target pathway prioritisation (*`xPierPathways`*) or pathway crosstalk identification (*`xPierSubnet`*). Notably, the function `xPierSNPsAdv` is supposed to analyse GWAS summary data with disease-associated SNPs and the significance level provided only, while the function `xPierSNPsAdvABF` requires additional info (effect/other alleles, effect size, standard error, etc) from GWAS summary data enabling the eGene identification and effect estimation through GWAS-eQTL colocalisation analysis (specifically, coloc - Approximate Bayes Factor).

Figure 2: Pi functions
Designed in a nested way following this route: xPierSNPsAdv or xPierSNPsAdvABF -> xPierSNPs -> xPierGenes -> xPier -> xRWR to prepare a gene-predict matrix and prioritisation of targe genes (xPierMatrix) taking as inputs GWAS summary statistics/data for specific traits. The output of this route is further taken as inputs for target pathway prioritisation (xPierPathways) or pathway crosstalk identification (xPierSubnet). Notably, the function xPierSNPsAdv is supposed to analyse GWAS summary data with disease-associated SNPs and the significance level provided only, while the function xPierSNPsAdvABF requires additional info (effect/other alleles, effect size, standard error, etc) from GWAS summary data enabling the eGene identification and effect estimation through GWAS-eQTL colocalisation analysis (specifically, coloc - Approximate Bayes Factor).

5 Getting Started

Using curated GWAS summary data for rheumatoid arthritis (RA) as inputs, this section gives a step-by-step demo showing how to use Pi to achieve disease-specific prioritisation of drug targets at the gene and pathway level

First of all, load the package and specify the location of built-in data.

library(Pi)
RData.location <- "http://galahad.well.ox.ac.uk/bigdata"

5.1 Input data

GWAS summary data for RA are primarily sourced from GWAS Catalog and ImmunoBase.

## make sure the package 'readr' installed
## BiocManager::install("readr")
data.file <- file.path(RData.location, "GWAS_summary_statistics.RA.txt.gz")
data <- readr::read_delim(data.file, delim="\t") %>% as.data.frame()

Table 1: The first 5 rows of `data`
The column `snp` for GWAS SNPs, `p` for GWAS-reported p-value, `effect` for the allele with effect assessed, `other` the other allele (major allele), `b` for the effect size, `se` for the standard error, and `suggestive` for the logic indicating SNPs (FALSE) only for nGene/cGene identification and scoring; however all SNPs (no matter TRUE or FALSE) used to define/score eGene.
snp	p	effect	other	b	se	suggestive
chr1:100007258	0.031	T	C	-0.1743534	0.08082870	TRUE
chr1:10000804	0.014	T	C	-0.5798185	0.23596107	TRUE
chr1:100008943	0.016	T	G	-0.2231436	0.09263237	TRUE
chr1:100016466	0.097	A	G	-0.1165338	0.07021908	TRUE
chr1:100029932	0.044	A	G	-0.2231436	0.11079121	TRUE

5.2 Target gene prioritisation

Parameters used

include.LD <- 'EUR'
LD.r2 <- 0.8
LD.customised <- NULL
significance.threshold <- 5e-8
score.cap <- 10
distance.max <- 20000
decay.kernel <- "constant"
decay.exponent <- 2
GR.SNP <- "dbSNP_GWAS"
#GR.SNP <- xRDataLoader('dbSNP_GWAS', RData.location=RData.location) # equivalent to this
GR.Gene <- "UCSC_knownGene"
#GR.Gene <- xRDataLoader('UCSC_knownGene', RData.location=RData.location) # equivalent to this
include.TAD <- "GM12878" # lymphoblast, reflective of immune-context genomic organisation
include.HiC <- c("Monocytes","Macrophages_M0","Macrophages_M1","Macrophages_M2","Neutrophils","Naive_CD4_T_cells","Total_CD4_T_cells","Naive_CD8_T_cells","Total_CD8_T_cells","Naive_B_cells","Total_B_cells")
include.eQTL <- c("Bcell","Blood","CD14","CD4","CD8","IFN","LPS24","LPS2","Neutrophil","NK")
eQTL.customised <- NULL
cdf.function <- "empirical"
scoring.scheme <- 'max'
network <- "STRING_high"
STRING.only <- NA
weighted <- FALSE
network.customised <- NULL
#network.customised <- xDefineNet("STRING_high", RData.location=RData.location)  # equivalent to this
seeds.inclusive <- TRUE
normalise <- "laplacian"
restart <- 0.7
normalise.affinity.matrix <- "none"
parallel <- TRUE
multicores <- NULL
verbose <- TRUE

Table 2: Promoter capture Hi-C datasets used to identify cGenes.
Code	Name
Monocytes	monocytes
Macrophages_M0	macrophages (M0)
Macrophages_M1	macrophages (M1)
Macrophages_M2	macrophages (M2)
Neutrophils	neutrophils
Naive_CD4_T_cells	CD4+ T cells (naïve)
Total_CD4_T_cells	CD4+ T cells (total)
Naive_CD8_T_cells	CD8+ T cells (naïve)
Total_CD8_T_cells	CD8+ T cells (total)
Naive_B_cells	B cells (naïve)
Total_B_cells	B cells (total)

Table 3: The eQTL mapping datasets used to identify eGenes.
Code	Name
CD14	CD14+ (monocytes)
LPS2	CD14+ by LPS2h
LPS24	CD14+ by LPS24h
IFN	CD14+ by IFN24h
Bcell	B cells
CD4	CD4+ T cells
CD8	CD8+ T cells
Neutrophil	neutrophils
NK	NK cells
Blood	peripheral blood

Prioritisation

# prepare predictors
## first, genomic predictors
ls_pNode_genomic <- xPierSNPsAdvABF(data, include.LD=include.LD, LD.customised=LD.customised, LD.r2=LD.r2, significance.threshold=significance.threshold, score.cap=score.cap, distance.max=distance.max, decay.kernel=decay.kernel, decay.exponent=decay.exponent, GR.SNP=GR.SNP, GR.Gene=GR.Gene, include.TAD=include.TAD, include.eQTL=include.eQTL, include.HiC=include.HiC, cdf.function=cdf.function, scoring.scheme=scoring.scheme, network=network, STRING.only=STRING.only, weighted=weighted, network.customised=network.customised, seeds.inclusive=seeds.inclusive, normalise=normalise, restart=restart, normalise.affinity.matrix=normalise.affinity.matrix, parallel=parallel, multicores=multicores, verbose=verbose, RData.location=RData.location)

## then, annotation predictors
data.file <- file.path(RData.location, "iAnno.txt")
iA <- read.delim(data.file, header=TRUE, stringsAsFactors=FALSE)[,c("Symbol","OMIM","Phenotype","Function")]
colnames(iA) <- c("Symbol","dGene","pGene","fGene")
ls_pNode_anno <- lapply(2:4, function(j){
    data_anno <- subset(data.frame(seed=iA$Symbol,weight=iA[,j],stringsAsFactors=F), weight>0)
    pNode <- xPierAnno(data_anno, list_pNode=ls_pNode_genomic, network=network, STRING.only=STRING.only, weighted=weighted, network.customised=network.customised, seeds.inclusive=seeds.inclusive, normalise=normalise, restart=restart, normalise.affinity.matrix=normalise.affinity.matrix, parallel=parallel, multicores=multicores, verbose=verbose, RData.location=RData.location)
})
names(ls_pNode_anno) <- colnames(iA)[2:4]
## bring together both predictors
ls_pNode <- c(ls_pNode_anno, ls_pNode_genomic)

# Prioritisation in a discovery mode
dTarget <- xPierMatrix(ls_pNode, displayBy="pvalue", aggregateBy="fishers", RData.location=RData.location)

The results are stored in an object called dTarget, a list with components including a data frame priority.

dTarget
## An object of S3 class 'dTarget', with 4 components:
##   $metag: an igraph object with 15432 nodes and 313918 edges
##   $predictor: a data frame of 15432 rows X 25 columns
##   $priority: a data frame of 15432 rows X 11 columns
##   $list_pNode: a list of 25 'pNode' objects
## 
## --------------------------------------------------
## $priority:
##      name rank  rating                                            description
##  HLA-DQA1    1 5.00000 major histocompatibility complex, class II, DQ alpha 1
##  HLA-DQB1    2 4.99281  major histocompatibility complex, class II, DQ beta 1
##  seed nGene cGene eGene dGene pGene fGene
##     Y     1     9     0     1     1     1
##     Y     1    11     1     1     1     1
## ......

The data frame dTarget$priority can be written into a text file Pi_output_priority.txt:

write.table(dTarget$priority, file="Pi_output_priority.txt", sep="\t", row.names=FALSE)

Manhattan plot

gp_manhattan <- xPierManhattan(dTarget, color='rainbow_hcl', top=30, top.label.type="text", top.label.size=3, top.label.col="black", y.lab="Pi rating", RData.location=RData.location)
gp_manhattan

$Manhattan plot. Priority scores (Pi rating) for genes are displayed on the Y-axis and genomic locations on the X-axis, with top 30 genes named.$

Figure 3: Manhattan plot
Priority scores (Pi rating) for genes are displayed on the Y-axis and genomic locations on the X-axis, with top 30 genes named.

5.3 Target pathway prioritisation

Pathway-level prioritisation is based on enrichment analysis of top 1% prioritised genes (top 150 genes) using a compendium of immune and signal transduction pathways (sourced from REACTOME).

priority.top <- 150
eTerm <- xPierPathways(dTarget, priority.top=priority.top, ontology="REACTOME", size.range=c(15,1500), test="fisher", min.overlap=10, RData.location=RData.location)
df_eTerm <- subset(xEnrichViewer(eTerm,'all',details=T), namespace %in% c('Immune System','Signal Transduction'))
gp_reactome <- xEnrichForest(df_eTerm, zlim=c(0,30))
gp_reactome

Figure 4: Forest plot of prioritised REACTOME pathways

5.4 Target pathway crosstalk

An algorithm is developed to search for a subset of a gene network (merged from KEGG pathways) in a way that the resulting gene subnetwork (or crosstalk between different pathways) contains highly prioritised genes with a few less prioritised genes as linkers.

# obtain a gene network merged from KEGG pathways
ls_ig <- lapply(c("KEGG_environmental","KEGG_organismal"), function(network){
    g <- xDefineNet(network=network, RData.location=RData.location)
})
ig <- xCombineNet(ls_ig)

# search for a subset (with a desired node number [30,35]) of the gene network 
g <- xPierSubnet(dTarget, network.customised=ig, subnet.size=33, RData.location=RData.location)

# visualise pathway crosstalk
gp_rating_evidence <- xVisEvidenceAdv(dTarget, nodes=V(g)$name, g=g, colormap="jet.top")
gp_rating_evidence

Visualisation of pathway crosstalk. Nodes/genes colored by Pi rating and labelled with evidence (the types of seed genes; colored segments).

Figure 5: Visualisation of pathway crosstalk
Nodes/genes colored by Pi rating and labelled with evidence (the types of seed genes; colored segments).