GREGOR enrichment analysis#

Description#

Aim. Evaluate the global enrichment of a set of trait-associated (positive) variants within experimentally annotated epigenomic regulatory features, relative to a matched negative set, using GREGOR (Genomic Regulatory Elements and Gwas Overlap algoRithm).

Methods overview#

GREGOR performs enrichment analysis for given positive set of variants against matched negative set, matched by:

  • MAF

  • LD proxy

  • distance to TSS

  • Gene density

Methods to obtain the column “PValue” from GREGOR software can be found in the GREGOR paper: https://pubmed.ncbi.nlm.nih.gov/25886982/. We have noticed some PValue are greater than 1. Also, p-value itself is not as informative compared to odds ratios of enrichment. We therefore parse intermediate outputs from GREGOR to get a 2 by 2 table where the rows are within annotation peak vs outside; the columns are positive set vs matched negative set. We perform Fisher’s exact test to get both p-value and odds ratios.

Input#

  • --gregor_db: a GREGOR reference database.

  • --index_snp_file: an index SNP file of the positive variant set.

  • --bed_file_index: a bed-file index listing the annotation bed files.

Note. gregor_conf (Step 1) only writes a configuration file from the supplied paths, so it runs end-to-end without the external GREGOR software - it is verified here against synthetic toy inputs under input/protocol_example.*. Step 2 (gregor), which invokes the GREGOR perl tool and parses its output, requires the external GREGOR software distributed under the University of Michigan license plus its reference database; that software is not bundled here and is not installed on this system, so Step 2 cannot be executed here. The gregor_fisher_plot workflow (Step 3) is self-contained R and only needs two enrichment-result tables.

Output#

All files are written under --cwd, prefixed by the run name:

  • {name}.gregor.conf — the GREGOR configuration file generated from the supplied paths (Step 1).

  • {name}_gregor_output/StatisticSummaryFile.txt — GREGOR’s raw overlap statistics for the annotation set (Step 2).

  • {name}_variant_counts.txt and {name}_enrichment_results.txt — the parsed 2×2 counts (annotation peak vs outside, positive vs matched-negative variant set) and the per-annotation Fisher exact-test p-values and odds ratios (Step 3).

  • {name}_vs_{ref}_enrichment.pdf — an odds-ratio comparison plot across annotations.

Steps#

Step 1. Build the GREGOR configuration file from the index SNP file and bed-file index (gregor_conf). This step only writes a .gregor.conf file from the supplied paths/parameters and does not invoke the external GREGOR software, so it runs end-to-end with the toy inputs below.

Timing: Requires external GREGOR tool on typical compute infrastructure.

sos run pipeline/gregor.ipynb gregor_conf \
    --gregor_db input/enrichment/protocol_example.gregor_ref \
    --index_snp_file input/ld/protocol_example.index.snps.txt \
    --bed_file_index input/enrichment/protocol_example.bed.file.index \
    --pop EUR \
    --cwd output/gregor

Step 2. Run the full GREGOR enrichment pipeline (config -> run GREGOR -> Fisher test) (gregor). Requires the GREGOR perl software and reference database.

sos run pipeline/gregor.ipynb gregor \
    --gregor_db input/enrichment/protocol_example.gregor_ref \
    --index_snp_file input/ld/protocol_example.index.snps.txt \
    --bed_file_index input/enrichment/protocol_example.bed.file.index \
    --pop EUR \
    --cwd output/gregor

Step 3. Plot an odds-ratio comparison from two enrichment-result tables (gregor_fisher_plot). Self-contained R; runnable from *_enrichment_results.txt tables alone.

sos run pipeline/gregor.ipynb gregor_fisher_plot \
    --fisher1 input/enrichment/protocol_example.trait1_enrichment_results.txt \
    --fisher2 input/enrichment/protocol_example.trait2_enrichment_results.txt \
    --cwd output/gregor

Anticipated Results#

Produces a *_enrichment_results.txt table of per-annotation Fisher p-values and odds ratios, and (via gregor_fisher_plot) an odds-ratio comparison PDF. Annotations enriched in the positive variant set show odds ratios above 1.

Command interface#

sos run pipeline/gregor.ipynb -h

Workflow implementation#

[global]
# working directory
parameter: cwd = path("output")
# Software container option
parameter: container = ""
# make configuration file for GREGOR
[gregor_conf, gregor_1]
parameter: gregor_db = path
parameter: pop = 'EUR'
parameter: index_snp_file = path
parameter: bed_file_index = path
parameter:  r2_threshold = 0.7
parameter: ld_window_size = 10000
parameter: min_neighbor = 10
parameter: job_number = 10
input: index_snp_file, bed_file_index
output: f'{cwd:a}/{_input[0]:bnn}.gregor.conf'
report: output = f'{_output}', expand = True
    ##############################################################################
    # CHIPSEQ ENRICHMENT CONFIGURATION FILE
    # This configuration file contains run-time configuration of
    # CHIP_SEQ ENRICHMENT
    ###############################################################################
    ## KEY ELEMENTS TO CONFIGURE : NEED TO MODIFY
    ###############################################################################
    INDEX_SNP_FILE = {_input[0]}
    BED_FILE_INDEX = {_input[1]} 
    REF_DIR = {gregor_db}
    R2THRESHOLD = {r2_threshold} ## must be greater than 0.7
    LDWINDOWSIZE = {ld_window_size} ## must be less than this window; these two values define LD buddies
    OUT_DIR = {_output:nn}_gregor_output
    MIN_NEIGHBOR_NUM = {min_neighbor} ## define the minimum size of neighborhood
    BEDFILE_IS_SORTED = true  ## false, if the bed files are not sorted
    POPULATION = {pop}  ## define the population, you can specify EUR, AFR, AMR or ASN
    TOPNBEDFILES = 2 
    JOBNUMBER = {job_number}
    ###############################################################################
    #BATCHTYPE = mosix ##  submit jobs on MOSIX
    #BATCHOPTS = -E/tmp -i -m2000 -j10,11,12,13,14,15,16,17,18,19,120,122,123,124,125 sh -c
    ###############################################################################
    #BATCHTYPE = slurm   ##  submit jobs on SLURM
    #BATCHOPTS = --partition=broadwl --account=pi-mstephens --time=0:30:0
    ###############################################################################
    BATCHTYPE = local ##  run jobs on local machine

bash: expand = True, stderr = f'{_output}.stderr', stdout = f'{_output}.stdout'
    sed -i '/^$/d' {_output}

GREGOR is written in perl. If you don’t use containers, some libraries are required before one can run GREGOR:

sudo apt-get install libdbi-perl libswitch-perl libdbd-sqlite3-perl

With our docker image:

cd GREGOR_folder
docker run -v "$PWD:/usr/src/myapp" -it custom-perl
perl script/GREGOR.pl --conf example/mvsusie_annotation.conf
perl script/GREGOR.pl --conf example/susie_annotation.conf
# run GREGOR
[gregor_2]
output: f'{_input:nn}_gregor_output/StatisticSummaryFile.txt'
bash: expand = True, container = container, stderr = f'{_output}.stderr', stdout = f'{_output}.stdout'
    perl /opt/GREGOR/GREGOR/script/GREGOR.pl --conf {_input} && touch {_output}
# Fisher test of enrichment
[gregor_3]
output: f'{_input:ad}_variant_counts.txt', f'{_input:ad}_enrichment_results.txt'
bash: expand = '$[ ]', stderr = f'{_output[0]}.stderr', stdout = f'{_output[0]}.stdout'

    cd $[_input:ad]/  # The issue was that,{_input:ad}_variant_counts.txt is not in the same dir as the input  
    # Loop through each subdirectory
    for dir in */; do
        # Ensure that the directory is not empty
        if [[ -d "$dir" ]]; then
            # Calculate the total number of lines in neighbor.*.txt files
            total_lines=$(find "$dir" -name 'neighbor.*.txt' -exec cat {} + | wc -l)

            # Count the number of neighbor.*.txt files
            file_count=$(find "$dir" -name 'neighbor.*.txt' | wc -l)

            # Calculate the adjusted row number
            row_num=$((total_lines - file_count))

            # Check if PValue.txt exists
            if [[ -f "${dir}/PValue.txt" ]]; then
                # Append the row count to PValue.txt and save as PValue_new.txt
                (cat "${dir}/PValue.txt"; echo "N1_N2 = $row_num") > "${dir}/PValue_N1_N2.txt"
            fi
        fi
    done

    # Header for the new summary file
    echo -e "Bed_File\tInBed_Index_SNP\tExpectNum_of_InBed_SNP\tPValue\tN1_N2\tNp\tNp_Nn" > $[_output[0]]

    # Calculate Np and Np_Nn for the neighbor_SNP directory
    if [[ -d "neighbor_SNP" ]]; then
        if [[ -f "neighbor_SNP/index.snp.neighbors.txt" ]]; then
            Np=$(($(wc -l < "index_SNP/annotated.index.snp.txt") - 1))
        else
            Np=0
        fi
        Np_Nn_files=($(find "neighbor_SNP" -name 'neighbor.chr*txt'))
        Np_Nn=0
        for file in "${Np_Nn_files[@]}"; do
            Np_Nn=$(($Np_Nn + $(wc -l < "$file")))
        done
        Np_Nn=$(($Np_Nn - ${#Np_Nn_files[@]}))
    fi

    # Loop through each subdirectory
    for dir in */; do
        # Ensure that the directory is not empty
        if [[ -d "$dir" ]]; then
            # Check if PValue_new.txt exists
            if [[ -f "${dir}PValue_N1_N2.txt" ]]; then
                # Extract values from PValue_new.txt
                inBedIndexSNPNum=$(grep "inBedIndexSNPNum" "${dir}PValue_N1_N2.txt" | cut -d '=' -f2 | tr -d '[:space:]')
                expectedS=$(grep "expectedS" "${dir}PValue_N1_N2.txt" | cut -d '=' -f2 | tr -d '[:space:]')
                p3=$(grep "p3" "${dir}PValue_N1_N2.txt" | cut -d '=' -f2 | tr -d '[:space:]')
                N1_N2=$(grep "N1_N2" "${dir}PValue_N1_N2.txt" | cut -d '=' -f2 | tr -d '[:space:]')
                # Get the directory name as the Bed_File
                bedFile=$(basename "$dir")
                # Append the data to the summary file
                echo -e "$bedFile\t$inBedIndexSNPNum\t$expectedS\t$p3\t$N1_N2\t$Np\t$Np_Nn" >> $[_output[0]]
            fi
        fi
    done

R: expand = '${ }', stderr = f'{_output[0]}.stderr', stdout = f'{_output[0]}.stdout'
    res <- read.table(${_output[0]:r}, sep ='\t', header=T) 
    for(i in 1:nrow(res)){
        n1 = res$InBed_Index_SNP[i]
        n2 = res$N1_N2[i] - n1
        np = res$Np[i]
        nn = res$Np_Nn[i] - np

        # Construct the contingency matrix
        dat = matrix(c(n1, n2, np - n1, nn - n2), nrow = 2)

        # Perform Fisher's exact test
        test_res = fisher.test(dat, alternative = 'two.sided')

        # Store the results in 'res'
        res$odds[i] <- test_res$estimate
        res$low[i] <- test_res$conf.int[1]
        res$high[i] <- test_res$conf.int[2]
        res$p_fisher[i] <- test_res$p.value
    }

    res$odds <- as.numeric(res$odds)
    write.table(res, ${_output[1]:r}, quote = FALSE, row.names = FALSE)
# enrichment plot with fisher test results
[gregor_fisher_plot]
parameter: fisher1 = path
parameter: fisher2 = path
input: fisher1, fisher2
output: f'{cwd:a}/{_input[0]:dbn}_vs_{_input[1]:dbn}_enrichment.pdf'
R: expand = '${ }', stderr = f'{_output[0]}.stderr', stdout = f'{_output[0]}.stdout', container = container
    library(tidyverse)
    library(ggplot2)    
    group1 <- read.table(${_input[0]:r}, header=T)
    group2 <- read.table(${_input[1]:r}, header=T)
    group1$group <- ${_input[0]:dbnr}
    group2$group <- ${_input[1]:dbnr}
    res_all <- rbind(group1, group2)
    res_all$feature <- gsub(".bed","",res_all$Bed_File)
    res_all <- res_all %>%filter(!(str_detect(feature,"hg38")))%>%filter(!(p_fisher == 1))%>% na.omit 
    p <- res_all%>%
        arrange(odds)%>%ggplot()+geom_point(aes(x = odds, y = reorder(feature,-odds), color = group))+
            geom_vline(aes( xintercept = 0 ))+theme_bw()+theme(text = element_text(size = 20))+xlab("Odds Ratio")+
                geom_linerange(aes(xmin = low, xmax= high , y = feature ), color= 'grey40')+ylab("Functional Annotations")
    ggsave(plot = p, filename = ${_output:r}, height = 24, width = 20)