GREGOR enrichment analysis#
GREGOR (Genomic Regulatory Elements and Gwas Overlap algoRithm) is a tool built to evaluate global enrichment of trait-associated variants in experimentally annotated epigenomic regulatory features.
Since GREGOR is released under UMichgan license, we build a Singularity container gregor.sif
to run it locally without pushing the image into any container repositories.
FIXME#
Ru: Include the dockerfile script under Methods Overview – DONE
Ru: Add
gregor_4
to include a default graphical summary –gregor_fisher_plot
, there are 2 traits on the plot, and I think it would be better to seperate from gregor_1/2/3.Hao: you may want to add base R environment to the container, so we can use container for all steps including
gregor_3
gregor_4
Hao: we need to try to get hg38 working – possibly by changing GREGOR reference data. The reference data preparation step is not yet well-documented by Ru. Please revisit and improve the documentation.
Methods overview#
GREGOR performs enrichment analysis for given positive set of variants against matched negative set, matched by:
MAF
LD proxy
distance to TSS
Gene density
Methods to obtain the column “PValue” from GREGOR software can be found in the GREGOR paper: https://pubmed.ncbi.nlm.nih.gov/25886982/. We have noticed some PValue are greater than 1. Also, p-value itself is not as informative compared to odds ratios of enrichment. We therefore parse intermediate outputs from GREGOR to get a 2 by 2 table where the rows are within annotation peak vs outside; the columns are positive set vs matched negative set. We perform Fisher’s exact test to get both p-value and odds ratios.
Reference data preparation#
hg19.fa
Genome Reference: hg19 (GRCh37) or hg38 (GRCh38).
Download link
Annotation files#
Alkes Price’s lab has some annotation files in bed
format available for download:
https://data.broadinstitute.org/alkesgroup/LDSCORE/
eg baselineLD_v2.2_bedfiles.tgz
.
Annotation files are in bed
format (for example: Promoter_UCSC.bed).
chr1 9873 16361
chr1 32610 36610
chr1 67090 71090
...
GREGOR reference files#
GREGOR reference files for hg19 can be prepared from downloaded reference data,
cat \
GREGOR.AFR.ref.r2.greater.than.0.7.tar.gz.part.00 \
GREGOR.AFR.ref.r2.greater.than.0.7.tar.gz.part.01 \
> GREGOR.AFR.ref.r2.greater.than.0.7.tar.gz
tar zxvf GREGOR.AFR.ref.r2.greater.than.0.7.tar.gz
MD5SUM check:
AFR.part.0 ( MD5: 9926904128dd58d6bf1ad4f1e90638af )
AFR.part.1 ( MD5: c1d30aff89a584bfa8c1fa1bdc197f21 )
Same for EUR data-set.
Example command#
sos run pipeline/gregor.ipynb gregor \
--gregor-db /mnt/vast/hpc/csg/rf2872/data/GREGOR/REF \
--index-snp-file GREGOR/example/example.index.snps.txt \
--bed-file-index GREGOR/example/example.bed.file.index \
--container gregor.sif
sos run gregor.ipynb -h
usage: sos run gregor.ipynb [workflow_name | -t targets] [options] [workflow_options]
workflow_name: Single or combined workflows defined in this script
targets: One or more targets to generate
options: Single-hyphen sos parameters (see "sos run -h" for details)
workflow_options: Double-hyphen workflow-specific parameters
Workflows:
gregor_conf
gregor
Global Workflow Options:
--cwd output (as path)
working directory
--container ''
Software container option
Sections
gregor_conf, gregor_1: make configuration file for GREGOR
Workflow Options:
--gregor-db VAL (as path, required)
--pop EUR
--index-snp-file VAL (as path, required)
--bed-file-index VAL (as path, required)
--ld-window-size 10000 (as int)
--min-neighbor 10 (as int)
--job-number 10 (as int)
gregor_2: run GREGOR
gregor_3: Fisher test of enrichment
Explaination for each step#
gregor_conf, gregor_1
: make the configuration file for GREGORgregor_2
: run GREGOR, for the details, please see Ellen et al.gregor_3
: The process for generating a 2x2 table for each annotation to conduct a Fisher test in the context of GREGOR analysis. The structure and computation of the elements in the 2x2 table are as follows:
Annotated Positive Set (N1): Derived from the
inBedIndexSNPNum
in the PValue.txt file located in each annotation folder. This value is integral to identifying the positive set that has been annotated.Un-annotated Positive Set (Np): Calculated as the total number of lines in the
index_SNP/annotated.index.snp.txt
file. It represents the count of input signals that were successfully converted to chromosome positions, forming the overall positive set.Annotated Negative Set (N2): This is computed by taking the total lines in the
neighbor*.txt
files (represented as N1 + N2) of each annotation folder and subtracting N1 from it, excluding any header lines. This gives the count of the negative set that has been annotated.Un-annotated Negative Set (Nn): This is computed by taking the total line in the
neighbor_SNP/neighbor*.txt
files and subtracting Np from it, with header lines excluded. This represents the un-annotated portion of the negative set.
gregor_fisher_test
: The visualization plot is designed to display the results of Fisher’s exact test, comparing two different traits. The input data for this plot is sourced from the ‘gregor_3’ outputs, with each dataset representing a distinct trait.
Example#
sos run ~/codes/xqtl-protocol/pipeline/gregor.ipynb gregor_fisher_plot \
--fisher1 mvSuSiE_Basophill_perc_positive_cs.index_gregor_output/*_enrichment_results.txt \
--fisher2 SuSiE_Basophill_perc_positive_cs.index_gregor_output/*_enrichment_results.txt \
--cwd test
[global]
# working directory
parameter: cwd = path("output")
# Software container option
parameter: container = ""
# make configuration file for GREGOR
[gregor_conf, gregor_1]
parameter: gregor_db = path
parameter: pop = 'EUR'
parameter: index_snp_file = path
parameter: bed_file_index = path
parameter: r2_threshold = 0.7
parameter: ld_window_size = 10000
parameter: min_neighbor = 10
parameter: job_number = 10
input: index_snp_file, bed_file_index
output: f'{cwd:a}/{_input[0]:bnn}.gregor.conf'
report: output = f'{_output}', expand = True
##############################################################################
# CHIPSEQ ENRICHMENT CONFIGURATION FILE
# This configuration file contains run-time configuration of
# CHIP_SEQ ENRICHMENT
###############################################################################
## KEY ELEMENTS TO CONFIGURE : NEED TO MODIFY
###############################################################################
INDEX_SNP_FILE = {_input[0]}
BED_FILE_INDEX = {_input[1]}
REF_DIR = {gregor_db}
R2THRESHOLD = {r2_threshold} ## must be greater than 0.7
LDWINDOWSIZE = {ld_window_size} ## must be less than this window; these two values define LD buddies
OUT_DIR = {_output:nn}_gregor_output
MIN_NEIGHBOR_NUM = {min_neighbor} ## define the minimum size of neighborhood
BEDFILE_IS_SORTED = true ## false, if the bed files are not sorted
POPULATION = {pop} ## define the population, you can specify EUR, AFR, AMR or ASN
TOPNBEDFILES = 2
JOBNUMBER = {job_number}
###############################################################################
#BATCHTYPE = mosix ## submit jobs on MOSIX
#BATCHOPTS = -E/tmp -i -m2000 -j10,11,12,13,14,15,16,17,18,19,120,122,123,124,125 sh -c
###############################################################################
#BATCHTYPE = slurm ## submit jobs on SLURM
#BATCHOPTS = --partition=broadwl --account=pi-mstephens --time=0:30:0
###############################################################################
BATCHTYPE = local ## run jobs on local machine
bash: expand = True, stderr = f'{_output}.stderr', stdout = f'{_output}.stdout'
sed -i '/^$/d' {_output}
GREGOR is written in perl
. If you don’t use containers, some libraries are required before one can run GREGOR:
sudo apt-get install libdbi-perl libswitch-perl libdbd-sqlite3-perl
With our docker image:
cd GREGOR_folder
docker run -v "$PWD:/usr/src/myapp" -it custom-perl
perl script/GREGOR.pl --conf example/mvsusie_annotation.conf
perl script/GREGOR.pl --conf example/susie_annotation.conf
# run GREGOR
[gregor_2]
output: f'{_input:nn}_gregor_output/StatisticSummaryFile.txt'
bash: expand = True, container = container, stderr = f'{_output}.stderr', stdout = f'{_output}.stdout'
perl /opt/GREGOR/GREGOR/script/GREGOR.pl --conf {_input} && touch {_output}
# Fisher test of enrichment
[gregor_3]
output: f'{_input:ad}_variant_counts.txt', f'{_input:ad}_enrichment_results.txt'
bash: expand = '$[ ]', stderr = f'{_output[0]}.stderr', stdout = f'{_output[0]}.stdout'
cd $[_input:ad]/ # The issue was that,{_input:ad}_variant_counts.txt is not in the same dir as the input
# Loop through each subdirectory
for dir in */; do
# Ensure that the directory is not empty
if [[ -d "$dir" ]]; then
# Calculate the total number of lines in neighbor.*.txt files
total_lines=$(find "$dir" -name 'neighbor.*.txt' -exec cat {} + | wc -l)
# Count the number of neighbor.*.txt files
file_count=$(find "$dir" -name 'neighbor.*.txt' | wc -l)
# Calculate the adjusted row number
row_num=$((total_lines - file_count))
# Check if PValue.txt exists
if [[ -f "${dir}/PValue.txt" ]]; then
# Append the row count to PValue.txt and save as PValue_new.txt
(cat "${dir}/PValue.txt"; echo "N1_N2 = $row_num") > "${dir}/PValue_N1_N2.txt"
fi
fi
done
# Header for the new summary file
echo -e "Bed_File\tInBed_Index_SNP\tExpectNum_of_InBed_SNP\tPValue\tN1_N2\tNp\tNp_Nn" > $[_output[0]]
# Calculate Np and Np_Nn for the neighbor_SNP directory
if [[ -d "neighbor_SNP" ]]; then
if [[ -f "neighbor_SNP/index.snp.neighbors.txt" ]]; then
Np=$(($(wc -l < "index_SNP/annotated.index.snp.txt") - 1))
else
Np=0
fi
Np_Nn_files=($(find "neighbor_SNP" -name 'neighbor.chr*txt'))
Np_Nn=0
for file in "${Np_Nn_files[@]}"; do
Np_Nn=$(($Np_Nn + $(wc -l < "$file")))
done
Np_Nn=$(($Np_Nn - ${#Np_Nn_files[@]}))
fi
# Loop through each subdirectory
for dir in */; do
# Ensure that the directory is not empty
if [[ -d "$dir" ]]; then
# Check if PValue_new.txt exists
if [[ -f "${dir}PValue_N1_N2.txt" ]]; then
# Extract values from PValue_new.txt
inBedIndexSNPNum=$(grep "inBedIndexSNPNum" "${dir}PValue_N1_N2.txt" | cut -d '=' -f2 | tr -d '[:space:]')
expectedS=$(grep "expectedS" "${dir}PValue_N1_N2.txt" | cut -d '=' -f2 | tr -d '[:space:]')
p3=$(grep "p3" "${dir}PValue_N1_N2.txt" | cut -d '=' -f2 | tr -d '[:space:]')
N1_N2=$(grep "N1_N2" "${dir}PValue_N1_N2.txt" | cut -d '=' -f2 | tr -d '[:space:]')
# Get the directory name as the Bed_File
bedFile=$(basename "$dir")
# Append the data to the summary file
echo -e "$bedFile\t$inBedIndexSNPNum\t$expectedS\t$p3\t$N1_N2\t$Np\t$Np_Nn" >> $[_output[0]]
fi
fi
done
R: expand = '${ }', stderr = f'{_output[0]}.stderr', stdout = f'{_output[0]}.stdout'
res <- read.table(${_output[0]:r}, sep ='\t', header=T)
for(i in 1:nrow(res)){
n1 = res$InBed_Index_SNP[i]
n2 = res$N1_N2[i] - n1
np = res$Np[i]
nn = res$Np_Nn[i] - np
# Construct the contingency matrix
dat = matrix(c(n1, n2, np - n1, nn - n2), nrow = 2)
# Perform Fisher's exact test
test_res = fisher.test(dat, alternative = 'two.sided')
# Store the results in 'res'
res$odds[i] <- test_res$estimate
res$low[i] <- test_res$conf.int[1]
res$high[i] <- test_res$conf.int[2]
res$p_fisher[i] <- test_res$p.value
}
res$odds <- as.numeric(res$odds)
write.table(res, ${_output[1]:r}, quote = FALSE, row.names = FALSE)
# enrichment plot with fisher test results
[gregor_fisher_plot]
parameter: fisher1 = path
parameter: fisher2 = path
input: fisher1, fisher2
output: f'{cwd:a}/{_input[0]:dbn}_vs_{_input[1]:dbn}_enrichment.pdf'
R: expand = '${ }', stderr = f'{_output[0]}.stderr', stdout = f'{_output[0]}.stdout', container = container
library(tidyverse)
library(ggplot2)
group1 <- read.table(${_input[0]:r}, header=T)
group2 <- read.table(${_input[1]:r}, header=T)
group1$group <- ${_input[0]:dbnr}
group2$group <- ${_input[1]:dbnr}
res_all <- rbind(group1, group2)
res_all$feature <- gsub(".bed","",res_all$Bed_File)
res_all <- res_all %>%filter(!(str_detect(feature,"hg38")))%>%filter(!(p_fisher == 1))%>% na.omit
p <- res_all%>%
arrange(odds)%>%ggplot()+geom_point(aes(x = odds, y = reorder(feature,-odds), color = group))+
geom_vline(aes( xintercept = 0 ))+theme_bw()+theme(text = element_text(size = 20))+xlab("Odds Ratio")+
geom_linerange(aes(xmin = low, xmax= high , y = feature ), color= 'grey40')+ylab("Functional Annotations")
ggsave(plot = p, filename = ${_output:r}, height = 24, width = 20)