GREGOR enrichment analysis#
Description#
Aim. Evaluate the global enrichment of a set of trait-associated (positive) variants within experimentally annotated epigenomic regulatory features, relative to a matched negative set, using GREGOR (Genomic Regulatory Elements and Gwas Overlap algoRithm).
Methods overview#
GREGOR performs enrichment analysis for given positive set of variants against matched negative set, matched by:
MAF
LD proxy
distance to TSS
Gene density
Methods to obtain the column “PValue” from GREGOR software can be found in the GREGOR paper: https://pubmed.ncbi.nlm.nih.gov/25886982/. We have noticed some PValue are greater than 1. Also, p-value itself is not as informative compared to odds ratios of enrichment. We therefore parse intermediate outputs from GREGOR to get a 2 by 2 table where the rows are within annotation peak vs outside; the columns are positive set vs matched negative set. We perform Fisher’s exact test to get both p-value and odds ratios.
Input#
--gregor_db: a GREGOR reference database.--index_snp_file: an index SNP file of the positive variant set.--bed_file_index: a bed-file index listing the annotationbedfiles.
Note.
gregor_conf(Step 1) only writes a configuration file from the supplied paths, so it runs end-to-end without the external GREGOR software - it is verified here against synthetic toy inputs underinput/protocol_example.*. Step 2 (gregor), which invokes the GREGORperltool and parses its output, requires the external GREGOR software distributed under the University of Michigan license plus its reference database; that software is not bundled here and is not installed on this system, so Step 2 cannot be executed here. Thegregor_fisher_plotworkflow (Step 3) is self-contained R and only needs two enrichment-result tables.
Output#
All files are written under --cwd, prefixed by the run name:
{name}.gregor.conf— the GREGOR configuration file generated from the supplied paths (Step 1).{name}_gregor_output/StatisticSummaryFile.txt— GREGOR’s raw overlap statistics for the annotation set (Step 2).{name}_variant_counts.txtand{name}_enrichment_results.txt— the parsed 2×2 counts (annotation peak vs outside, positive vs matched-negative variant set) and the per-annotation Fisher exact-test p-values and odds ratios (Step 3).{name}_vs_{ref}_enrichment.pdf— an odds-ratio comparison plot across annotations.
Steps#
Step 1. Build the GREGOR configuration file from the index SNP file and bed-file index (gregor_conf). This step only writes a .gregor.conf file from the supplied paths/parameters and does not invoke the external GREGOR software, so it runs end-to-end with the toy inputs below.
Timing: Requires external GREGOR tool on typical compute infrastructure.
sos run pipeline/gregor.ipynb gregor_conf \
--gregor_db input/enrichment/protocol_example.gregor_ref \
--index_snp_file input/ld/protocol_example.index.snps.txt \
--bed_file_index input/enrichment/protocol_example.bed.file.index \
--pop EUR \
--cwd output/gregor
Step 2. Run the full GREGOR enrichment pipeline (config -> run GREGOR -> Fisher test) (gregor). Requires the GREGOR perl software and reference database.
sos run pipeline/gregor.ipynb gregor \
--gregor_db input/enrichment/protocol_example.gregor_ref \
--index_snp_file input/ld/protocol_example.index.snps.txt \
--bed_file_index input/enrichment/protocol_example.bed.file.index \
--pop EUR \
--cwd output/gregor
Step 3. Plot an odds-ratio comparison from two enrichment-result tables (gregor_fisher_plot). Self-contained R; runnable from *_enrichment_results.txt tables alone.
sos run pipeline/gregor.ipynb gregor_fisher_plot \
--fisher1 input/enrichment/protocol_example.trait1_enrichment_results.txt \
--fisher2 input/enrichment/protocol_example.trait2_enrichment_results.txt \
--cwd output/gregor
Anticipated Results#
Produces a *_enrichment_results.txt table of per-annotation Fisher p-values and odds ratios, and (via gregor_fisher_plot) an odds-ratio comparison PDF. Annotations enriched in the positive variant set show odds ratios above 1.
Command interface#
sos run pipeline/gregor.ipynb -h
Workflow implementation#
[global]
# working directory
parameter: cwd = path("output")
# Software container option
parameter: container = ""
# make configuration file for GREGOR
[gregor_conf, gregor_1]
parameter: gregor_db = path
parameter: pop = 'EUR'
parameter: index_snp_file = path
parameter: bed_file_index = path
parameter: r2_threshold = 0.7
parameter: ld_window_size = 10000
parameter: min_neighbor = 10
parameter: job_number = 10
input: index_snp_file, bed_file_index
output: f'{cwd:a}/{_input[0]:bnn}.gregor.conf'
report: output = f'{_output}', expand = True
##############################################################################
# CHIPSEQ ENRICHMENT CONFIGURATION FILE
# This configuration file contains run-time configuration of
# CHIP_SEQ ENRICHMENT
###############################################################################
## KEY ELEMENTS TO CONFIGURE : NEED TO MODIFY
###############################################################################
INDEX_SNP_FILE = {_input[0]}
BED_FILE_INDEX = {_input[1]}
REF_DIR = {gregor_db}
R2THRESHOLD = {r2_threshold} ## must be greater than 0.7
LDWINDOWSIZE = {ld_window_size} ## must be less than this window; these two values define LD buddies
OUT_DIR = {_output:nn}_gregor_output
MIN_NEIGHBOR_NUM = {min_neighbor} ## define the minimum size of neighborhood
BEDFILE_IS_SORTED = true ## false, if the bed files are not sorted
POPULATION = {pop} ## define the population, you can specify EUR, AFR, AMR or ASN
TOPNBEDFILES = 2
JOBNUMBER = {job_number}
###############################################################################
#BATCHTYPE = mosix ## submit jobs on MOSIX
#BATCHOPTS = -E/tmp -i -m2000 -j10,11,12,13,14,15,16,17,18,19,120,122,123,124,125 sh -c
###############################################################################
#BATCHTYPE = slurm ## submit jobs on SLURM
#BATCHOPTS = --partition=broadwl --account=pi-mstephens --time=0:30:0
###############################################################################
BATCHTYPE = local ## run jobs on local machine
bash: expand = True, stderr = f'{_output}.stderr', stdout = f'{_output}.stdout'
sed -i '/^$/d' {_output}
GREGOR is written in perl. If you don’t use containers, some libraries are required before one can run GREGOR:
sudo apt-get install libdbi-perl libswitch-perl libdbd-sqlite3-perl
With our docker image:
cd GREGOR_folder
docker run -v "$PWD:/usr/src/myapp" -it custom-perl
perl script/GREGOR.pl --conf example/mvsusie_annotation.conf
perl script/GREGOR.pl --conf example/susie_annotation.conf
# run GREGOR
[gregor_2]
output: f'{_input:nn}_gregor_output/StatisticSummaryFile.txt'
bash: expand = True, container = container, stderr = f'{_output}.stderr', stdout = f'{_output}.stdout'
perl /opt/GREGOR/GREGOR/script/GREGOR.pl --conf {_input} && touch {_output}
# Fisher test of enrichment
[gregor_3]
output: f'{_input:ad}_variant_counts.txt', f'{_input:ad}_enrichment_results.txt'
bash: expand = '$[ ]', stderr = f'{_output[0]}.stderr', stdout = f'{_output[0]}.stdout'
cd $[_input:ad]/ # The issue was that,{_input:ad}_variant_counts.txt is not in the same dir as the input
# Loop through each subdirectory
for dir in */; do
# Ensure that the directory is not empty
if [[ -d "$dir" ]]; then
# Calculate the total number of lines in neighbor.*.txt files
total_lines=$(find "$dir" -name 'neighbor.*.txt' -exec cat {} + | wc -l)
# Count the number of neighbor.*.txt files
file_count=$(find "$dir" -name 'neighbor.*.txt' | wc -l)
# Calculate the adjusted row number
row_num=$((total_lines - file_count))
# Check if PValue.txt exists
if [[ -f "${dir}/PValue.txt" ]]; then
# Append the row count to PValue.txt and save as PValue_new.txt
(cat "${dir}/PValue.txt"; echo "N1_N2 = $row_num") > "${dir}/PValue_N1_N2.txt"
fi
fi
done
# Header for the new summary file
echo -e "Bed_File\tInBed_Index_SNP\tExpectNum_of_InBed_SNP\tPValue\tN1_N2\tNp\tNp_Nn" > $[_output[0]]
# Calculate Np and Np_Nn for the neighbor_SNP directory
if [[ -d "neighbor_SNP" ]]; then
if [[ -f "neighbor_SNP/index.snp.neighbors.txt" ]]; then
Np=$(($(wc -l < "index_SNP/annotated.index.snp.txt") - 1))
else
Np=0
fi
Np_Nn_files=($(find "neighbor_SNP" -name 'neighbor.chr*txt'))
Np_Nn=0
for file in "${Np_Nn_files[@]}"; do
Np_Nn=$(($Np_Nn + $(wc -l < "$file")))
done
Np_Nn=$(($Np_Nn - ${#Np_Nn_files[@]}))
fi
# Loop through each subdirectory
for dir in */; do
# Ensure that the directory is not empty
if [[ -d "$dir" ]]; then
# Check if PValue_new.txt exists
if [[ -f "${dir}PValue_N1_N2.txt" ]]; then
# Extract values from PValue_new.txt
inBedIndexSNPNum=$(grep "inBedIndexSNPNum" "${dir}PValue_N1_N2.txt" | cut -d '=' -f2 | tr -d '[:space:]')
expectedS=$(grep "expectedS" "${dir}PValue_N1_N2.txt" | cut -d '=' -f2 | tr -d '[:space:]')
p3=$(grep "p3" "${dir}PValue_N1_N2.txt" | cut -d '=' -f2 | tr -d '[:space:]')
N1_N2=$(grep "N1_N2" "${dir}PValue_N1_N2.txt" | cut -d '=' -f2 | tr -d '[:space:]')
# Get the directory name as the Bed_File
bedFile=$(basename "$dir")
# Append the data to the summary file
echo -e "$bedFile\t$inBedIndexSNPNum\t$expectedS\t$p3\t$N1_N2\t$Np\t$Np_Nn" >> $[_output[0]]
fi
fi
done
R: expand = '${ }', stderr = f'{_output[0]}.stderr', stdout = f'{_output[0]}.stdout'
res <- read.table(${_output[0]:r}, sep ='\t', header=T)
for(i in 1:nrow(res)){
n1 = res$InBed_Index_SNP[i]
n2 = res$N1_N2[i] - n1
np = res$Np[i]
nn = res$Np_Nn[i] - np
# Construct the contingency matrix
dat = matrix(c(n1, n2, np - n1, nn - n2), nrow = 2)
# Perform Fisher's exact test
test_res = fisher.test(dat, alternative = 'two.sided')
# Store the results in 'res'
res$odds[i] <- test_res$estimate
res$low[i] <- test_res$conf.int[1]
res$high[i] <- test_res$conf.int[2]
res$p_fisher[i] <- test_res$p.value
}
res$odds <- as.numeric(res$odds)
write.table(res, ${_output[1]:r}, quote = FALSE, row.names = FALSE)
# enrichment plot with fisher test results
[gregor_fisher_plot]
parameter: fisher1 = path
parameter: fisher2 = path
input: fisher1, fisher2
output: f'{cwd:a}/{_input[0]:dbn}_vs_{_input[1]:dbn}_enrichment.pdf'
R: expand = '${ }', stderr = f'{_output[0]}.stderr', stdout = f'{_output[0]}.stdout', container = container
library(tidyverse)
library(ggplot2)
group1 <- read.table(${_input[0]:r}, header=T)
group2 <- read.table(${_input[1]:r}, header=T)
group1$group <- ${_input[0]:dbnr}
group2$group <- ${_input[1]:dbnr}
res_all <- rbind(group1, group2)
res_all$feature <- gsub(".bed","",res_all$Bed_File)
res_all <- res_all %>%filter(!(str_detect(feature,"hg38")))%>%filter(!(p_fisher == 1))%>% na.omit
p <- res_all%>%
arrange(odds)%>%ggplot()+geom_point(aes(x = odds, y = reorder(feature,-odds), color = group))+
geom_vline(aes( xintercept = 0 ))+theme_bw()+theme(text = element_text(size = 20))+xlab("Odds Ratio")+
geom_linerange(aes(xmin = low, xmax= high , y = feature ), color= 'grey40')+ylab("Functional Annotations")
ggsave(plot = p, filename = ${_output:r}, height = 24, width = 20)