GREGOR enrichment analysis

GREGOR enrichment analysis#

GREGOR (Genomic Regulatory Elements and Gwas Overlap algoRithm) is a tool built to evaluate global enrichment of trait-associated variants in experimentally annotated epigenomic regulatory features.

Since GREGOR is released under UMichgan license, we build a Singularity container gregor.sif to run it locally without pushing the image into any container repositories.

FIXME#

Ru: Include the dockerfile script under Methods Overview – DONE
Ru: Add gregor_4 to include a default graphical summary – gregor_fisher_plot, there are 2 traits on the plot, and I think it would be better to seperate from gregor_1/2/3.
Hao: you may want to add base R environment to the container, so we can use container for all steps including gregor_3 gregor_4
Hao: we need to try to get hg38 working – possibly by changing GREGOR reference data. The reference data preparation step is not yet well-documented by Ru. Please revisit and improve the documentation.

Methods overview#

GREGOR performs enrichment analysis for given positive set of variants against matched negative set, matched by:

MAF
LD proxy
distance to TSS
Gene density

Methods to obtain the column “PValue” from GREGOR software can be found in the GREGOR paper: https://pubmed.ncbi.nlm.nih.gov/25886982/. We have noticed some PValue are greater than 1. Also, p-value itself is not as informative compared to odds ratios of enrichment. We therefore parse intermediate outputs from GREGOR to get a 2 by 2 table where the rows are within annotation peak vs outside; the columns are positive set vs matched negative set. We perform Fisher’s exact test to get both p-value and odds ratios.

Reference data preparation#

hg19.fa

Genome Reference: hg19 (GRCh37) or hg38 (GRCh38).
Download link

Annotation files#

Alkes Price’s lab has some annotation files in bed format available for download:

https://data.broadinstitute.org/alkesgroup/LDSCORE/

eg baselineLD_v2.2_bedfiles.tgz.

Annotation files are in bed format (for example: Promoter_UCSC.bed).

    chr1	9873	16361
    chr1	32610	36610
    chr1	67090	71090
    ...

GREGOR reference files#

GREGOR reference files for hg19 can be prepared from downloaded reference data,

cat \
    GREGOR.AFR.ref.r2.greater.than.0.7.tar.gz.part.00 \
    GREGOR.AFR.ref.r2.greater.than.0.7.tar.gz.part.01 \
    > GREGOR.AFR.ref.r2.greater.than.0.7.tar.gz
tar zxvf GREGOR.AFR.ref.r2.greater.than.0.7.tar.gz

MD5SUM check:

AFR.part.0	( MD5: 9926904128dd58d6bf1ad4f1e90638af )
AFR.part.1	( MD5: c1d30aff89a584bfa8c1fa1bdc197f21 )

Same for EUR data-set.

Example command#

sos run pipeline/gregor.ipynb gregor \
    --gregor-db /mnt/vast/hpc/csg/rf2872/data/GREGOR/REF \
    --index-snp-file GREGOR/example/example.index.snps.txt \
    --bed-file-index GREGOR/example/example.bed.file.index  \
    --container gregor.sif 

sos run gregor.ipynb -h

usage: sos run gregor.ipynb [workflow_name | -t targets] [options] [workflow_options]
  workflow_name:        Single or combined workflows defined in this script
  targets:              One or more targets to generate
  options:              Single-hyphen sos parameters (see "sos run -h" for details)
  workflow_options:     Double-hyphen workflow-specific parameters

Workflows:
  gregor_conf
  gregor

Global Workflow Options:
  --cwd output (as path)
                        working directory
  --container ''
                        Software container option

Sections
  gregor_conf, gregor_1: make configuration file for GREGOR
    Workflow Options:
      --gregor-db VAL (as path, required)
      --pop EUR
      --index-snp-file VAL (as path, required)
      --bed-file-index VAL (as path, required)
      --ld-window-size 10000 (as int)
      --min-neighbor 10 (as int)
      --job-number 10 (as int)
  gregor_2:             run GREGOR
  gregor_3:             Fisher test of enrichment

Explaination for each step#

gregor_conf, gregor_1: make the configuration file for GREGOR
gregor_2: run GREGOR, for the details, please see Ellen et al.
gregor_3: The process for generating a 2x2 table for each annotation to conduct a Fisher test in the context of GREGOR analysis. The structure and computation of the elements in the 2x2 table are as follows:

Annotated Positive Set (N1): Derived from the inBedIndexSNPNum in the PValue.txt file located in each annotation folder. This value is integral to identifying the positive set that has been annotated.
Un-annotated Positive Set (Np): Calculated as the total number of lines in the index_SNP/annotated.index.snp.txt file. It represents the count of input signals that were successfully converted to chromosome positions, forming the overall positive set.
Annotated Negative Set (N2): This is computed by taking the total lines in the neighbor*.txt files (represented as N1 + N2) of each annotation folder and subtracting N1 from it, excluding any header lines. This gives the count of the negative set that has been annotated.
Un-annotated Negative Set (Nn): This is computed by taking the total line in the neighbor_SNP/neighbor*.txt files and subtracting Np from it, with header lines excluded. This represents the un-annotated portion of the negative set.

gregor_fisher_test: The visualization plot is designed to display the results of Fisher’s exact test, comparing two different traits. The input data for this plot is sourced from the ‘gregor_3’ outputs, with each dataset representing a distinct trait.

Example#

sos run ~/codes/xqtl-protocol/pipeline/gregor.ipynb  gregor_fisher_plot \
    --fisher1  mvSuSiE_Basophill_perc_positive_cs.index_gregor_output/*_enrichment_results.txt \
    --fisher2 SuSiE_Basophill_perc_positive_cs.index_gregor_output/*_enrichment_results.txt \
    --cwd test

[global]
# working directory
parameter: cwd = path("output")
# Software container option
parameter: container = ""

# make configuration file for GREGOR
[gregor_conf, gregor_1]
parameter: gregor_db = path
parameter: pop = 'EUR'
parameter: index_snp_file = path
parameter: bed_file_index = path
parameter:  r2_threshold = 0.7
parameter: ld_window_size = 10000
parameter: min_neighbor = 10
parameter: job_number = 10
input: index_snp_file, bed_file_index
output: f'{cwd:a}/{_input[0]:bnn}.gregor.conf'
report: output = f'{_output}', expand = True
    ##############################################################################
    # CHIPSEQ ENRICHMENT CONFIGURATION FILE
    # This configuration file contains run-time configuration of
    # CHIP_SEQ ENRICHMENT
    ###############################################################################
    ## KEY ELEMENTS TO CONFIGURE : NEED TO MODIFY
    ###############################################################################
    INDEX_SNP_FILE = {_input[0]}
    BED_FILE_INDEX = {_input[1]} 
    REF_DIR = {gregor_db}
    R2THRESHOLD = {r2_threshold} ## must be greater than 0.7
    LDWINDOWSIZE = {ld_window_size} ## must be less than this window; these two values define LD buddies
    OUT_DIR = {_output:nn}_gregor_output
    MIN_NEIGHBOR_NUM = {min_neighbor} ## define the minimum size of neighborhood
    BEDFILE_IS_SORTED = true  ## false, if the bed files are not sorted
    POPULATION = {pop}  ## define the population, you can specify EUR, AFR, AMR or ASN
    TOPNBEDFILES = 2 
    JOBNUMBER = {job_number}
    ###############################################################################
    #BATCHTYPE = mosix ##  submit jobs on MOSIX
    #BATCHOPTS = -E/tmp -i -m2000 -j10,11,12,13,14,15,16,17,18,19,120,122,123,124,125 sh -c
    ###############################################################################
    #BATCHTYPE = slurm   ##  submit jobs on SLURM
    #BATCHOPTS = --partition=broadwl --account=pi-mstephens --time=0:30:0
    ###############################################################################
    BATCHTYPE = local ##  run jobs on local machine

bash: expand = True, stderr = f'{_output}.stderr', stdout = f'{_output}.stdout'
    sed -i '/^$/d' {_output}

GREGOR is written in perl. If you don’t use containers, some libraries are required before one can run GREGOR:

sudo apt-get install libdbi-perl libswitch-perl libdbd-sqlite3-perl

With our docker image:

cd GREGOR_folder
docker run -v "$PWD:/usr/src/myapp" -it custom-perl
perl script/GREGOR.pl --conf example/mvsusie_annotation.conf
perl script/GREGOR.pl --conf example/susie_annotation.conf

# run GREGOR
[gregor_2]
output: f'{_input:nn}_gregor_output/StatisticSummaryFile.txt'
bash: expand = True, container = container, stderr = f'{_output}.stderr', stdout = f'{_output}.stdout'
    perl /opt/GREGOR/GREGOR/script/GREGOR.pl --conf {_input} && touch {_output}

# Fisher test of enrichment
[gregor_3]
output: f'{_input:ad}_variant_counts.txt', f'{_input:ad}_enrichment_results.txt'
bash: expand = '$[ ]', stderr = f'{_output[0]}.stderr', stdout = f'{_output[0]}.stdout'

    cd $[_input:ad]/  # The issue was that,{_input:ad}_variant_counts.txt is not in the same dir as the input  
    # Loop through each subdirectory
    for dir in */; do
        # Ensure that the directory is not empty
        if [[ -d "$dir" ]]; then
            # Calculate the total number of lines in neighbor.*.txt files
            total_lines=$(find "$dir" -name 'neighbor.*.txt' -exec cat {} + | wc -l)

            # Count the number of neighbor.*.txt files
            file_count=$(find "$dir" -name 'neighbor.*.txt' | wc -l)

            # Calculate the adjusted row number
            row_num=$((total_lines - file_count))

            # Check if PValue.txt exists
            if [[ -f "${dir}/PValue.txt" ]]; then
                # Append the row count to PValue.txt and save as PValue_new.txt
                (cat "${dir}/PValue.txt"; echo "N1_N2 = $row_num") > "${dir}/PValue_N1_N2.txt"
            fi
        fi
    done

    # Header for the new summary file
    echo -e "Bed_File\tInBed_Index_SNP\tExpectNum_of_InBed_SNP\tPValue\tN1_N2\tNp\tNp_Nn" > $[_output[0]]

    # Calculate Np and Np_Nn for the neighbor_SNP directory
    if [[ -d "neighbor_SNP" ]]; then
        if [[ -f "neighbor_SNP/index.snp.neighbors.txt" ]]; then
            Np=$(($(wc -l < "index_SNP/annotated.index.snp.txt") - 1))
        else
            Np=0
        fi
        Np_Nn_files=($(find "neighbor_SNP" -name 'neighbor.chr*txt'))
        Np_Nn=0
        for file in "${Np_Nn_files[@]}"; do
            Np_Nn=$(($Np_Nn + $(wc -l < "$file")))
        done
        Np_Nn=$(($Np_Nn - ${#Np_Nn_files[@]}))
    fi

    # Loop through each subdirectory
    for dir in */; do
        # Ensure that the directory is not empty
        if [[ -d "$dir" ]]; then
            # Check if PValue_new.txt exists
            if [[ -f "${dir}PValue_N1_N2.txt" ]]; then
                # Extract values from PValue_new.txt
                inBedIndexSNPNum=$(grep "inBedIndexSNPNum" "${dir}PValue_N1_N2.txt" | cut -d '=' -f2 | tr -d '[:space:]')
                expectedS=$(grep "expectedS" "${dir}PValue_N1_N2.txt" | cut -d '=' -f2 | tr -d '[:space:]')
                p3=$(grep "p3" "${dir}PValue_N1_N2.txt" | cut -d '=' -f2 | tr -d '[:space:]')
                N1_N2=$(grep "N1_N2" "${dir}PValue_N1_N2.txt" | cut -d '=' -f2 | tr -d '[:space:]')
                # Get the directory name as the Bed_File
                bedFile=$(basename "$dir")
                # Append the data to the summary file
                echo -e "$bedFile\t$inBedIndexSNPNum\t$expectedS\t$p3\t$N1_N2\t$Np\t$Np_Nn" >> $[_output[0]]
            fi
        fi
    done

R: expand = '${ }', stderr = f'{_output[0]}.stderr', stdout = f'{_output[0]}.stdout'
    res <- read.table(${_output[0]:r}, sep ='\t', header=T) 
    for(i in 1:nrow(res)){
        n1 = res$InBed_Index_SNP[i]
        n2 = res$N1_N2[i] - n1
        np = res$Np[i]
        nn = res$Np_Nn[i] - np

        # Construct the contingency matrix
        dat = matrix(c(n1, n2, np - n1, nn - n2), nrow = 2)

        # Perform Fisher's exact test
        test_res = fisher.test(dat, alternative = 'two.sided')

        # Store the results in 'res'
        res$odds[i] <- test_res$estimate
        res$low[i] <- test_res$conf.int[1]
        res$high[i] <- test_res$conf.int[2]
        res$p_fisher[i] <- test_res$p.value
    }

    res$odds <- as.numeric(res$odds)
    write.table(res, ${_output[1]:r}, quote = FALSE, row.names = FALSE)

# enrichment plot with fisher test results
[gregor_fisher_plot]
parameter: fisher1 = path
parameter: fisher2 = path
input: fisher1, fisher2
output: f'{cwd:a}/{_input[0]:dbn}_vs_{_input[1]:dbn}_enrichment.pdf'
R: expand = '${ }', stderr = f'{_output[0]}.stderr', stdout = f'{_output[0]}.stdout', container = container
    library(tidyverse)
    library(ggplot2)    
    group1 <- read.table(${_input[0]:r}, header=T)
    group2 <- read.table(${_input[1]:r}, header=T)
    group1$group <- ${_input[0]:dbnr}
    group2$group <- ${_input[1]:dbnr}
    res_all <- rbind(group1, group2)
    res_all$feature <- gsub(".bed","",res_all$Bed_File)
    res_all <- res_all %>%filter(!(str_detect(feature,"hg38")))%>%filter(!(p_fisher == 1))%>% na.omit 
    p <- res_all%>%
        arrange(odds)%>%ggplot()+geom_point(aes(x = odds, y = reorder(feature,-odds), color = group))+
            geom_vline(aes( xintercept = 0 ))+theme_bw()+theme(text = element_text(size = 20))+xlab("Odds Ratio")+
                geom_linerange(aes(xmin = low, xmax= high , y = feature ), color= 'grey40')+ylab("Functional Annotations")
    ggsave(plot = p, filename = ${_output:r}, height = 24, width = 20)