Chromosome-Specific Enrichment Analysis of Annotations Using Block Jackknife#

Description#

This notebook performs a chromosome-specific enrichment analysis for genomic annotations using a block jackknife approach. It computes odds ratios (OR) and enrichment statistics for each annotation column by systematically leaving out one chromosome at a time and recalculating the statistics. The analysis provides insight into the overlap and significance of annotations in relation to significant variants within a genomic region.

Key features:

  • Helper function to compute odds ratios and enrichment.calculate_OR_enrichment

  • Chromosome-level leave-one-out cross-validation.

  • Summary statistics, including means and standard errors for OR and enrichment values.

The results are saved as an RDS file for downstream analysis.

Definitions and Test Statistics#

Odds Ratio (OR)#

The Odds Ratio (OR) quantifies the strength of association between significant variants and a specific annotation.

Formula: $\( OR = \frac{\left| AB \right| / \left| A \setminus B \right|}{\left| \text{noA-noB} \right| / \left| B \setminus A \right|} \)$

Where:

  • \(A\): The set of SNPs within the annotation.

  • \(B\): The set of significant SNPs.

  • \(AB\): The intersection of \(A\) and \(B\) (i.e., significant SNPs in the annotation).

  • \(A \setminus B\): SNPs in the annotation but not significant.

  • \(B \setminus A\): Significant SNPs not in the annotation.

  • \(\text{noA-noB}\): SNPs that are neither in the annotation nor significant.


Enrichment#

The Enrichment evaluates whether a genomic annotation contains a higher proportion of significant SNPs than expected by chance.

Formula: $\( \text{Enrichment} = \frac{\text{Proportion of significant SNPs in the annotation}}{\text{Proportion of all SNPs in the annotation}} \)$

Or equivalently: $\( \text{Enrichment} = \frac{\frac{\left| AB \right|}{\left| B \right|}}{\frac{\left| A \right|}{\left| \text{Target Set} \right|}} \)$

Where:

  • \(\left| AB \right|\): Significant SNPs in the annotation.

  • \(\left| B \right|\): Total number of significant SNPs.

  • \(\left| A \right|\): Total number of SNPs in the annotation.

  • \(\left| \text{Target Set} \right|\): Total number of SNPs in the genome or study region.


Standard Error (SE) Computation#

Leave-One-Chromosome-Out (LOCO) Jackknife#

The LOCO method estimates the standard error by removing one chromosome at a time and recomputing the test statistic, capturing variability due to genomic structure.

Steps:

  1. For each chromosome \(i\):

    • Remove chromosome \(i\) from the dataset.

    • Compute \(OR_i\) and \(\text{Enrichment}_i\) using the remaining chromosomes.

  2. Aggregate \(OR_i\) and \(\text{Enrichment}_i\) to compute the mean and SE.

SE Formula#

Using the LOCO estimates (\(\theta_i\) for each chromosome \(i\)): $\( \text{SE}(\theta) = \sqrt{\frac{\sum_{i=1}^{N} (\theta_i - \bar{\theta})^2}{N \cdot (N-1)}} \)$

Where:

  • \(\theta_i\): Statistic (\(OR_i\) or \(\text{Enrichment}_i\)) excluding chromosome \(i\).

  • \(\bar{\theta}\): Mean of \(\theta_i\) across chromosomes.

  • \(N\): Number of chromosomes (e.g., 22 for autosomes).


Computational Workflow#

Step 1: Odds Ratio and Enrichment Computation#

  1. Compute \(OR\) and \(\text{Enrichment}\) using the formulas above.

  2. Repeat for each chromosome using the LOCO approach.

Step 2: Aggregation#

  1. Compute mean \(OR\) and \(\text{Enrichment}\) across chromosomes.

  2. Estimate SE using the jackknife method.

Step 3: Summary Outputs#

Generate summary statistics for each annotation, including:

  • Mean \(OR\), SE of \(OR\).

  • Mean \(\text{Enrichment}\), SE of \(\text{Enrichment}\).

Input Format#

1. significant_variants_path#

  • Format: RDS file containing significant variants.

  • Columns:

    • chr: Chromosome number (integer).

    • pos: Genomic position (integer).

  • Example:

    chr  pos
    1    12345
    1    67890
    

2. baseline_anno_path#

  • Format: RDS file containing a tabular data frame with baseline annotations.

  • Columns:

    • CHR: Chromosome number (integer).

    • BP: Genomic base pair position (integer).

    • SNP: SNP ID (character).

    • CM: Centimorgan position (numeric).

    • base: Base-level information (integer).

    • Annotation columns: Binary columns (0/1) for various genomic annotations (e.g., Coding_UCSC, Conserved_LindbladToh, CTCF_Hoffman, etc.).

    • Flanking regions: Binary columns indicating the presence of annotations within 500 bp of the main region (e.g., Coding_UCSC.flanking.500, Conserved_LindbladToh.flanking.500, etc.).

    • Additional annotation metrics: Annotation statistics like species enhancer counts and promoter scores (Human_Enhancer_Villar_Species_Enhancer_Count, Human_Promoter_Villar_ExAC, etc.).

  • Example:

    CHR   BP    SNP           CM   base   Coding_UCSC   Coding_UCSC.flanking.500      Human_Enhancer_Villar   Human_Enhancer_Villar.flanking.500
    1     11008 rs575272151   0    1      0             0                             0                        0
    1     11012 rs544419019   0    1      0             0                             0                        0
    1     13110 rs540538026   0    1      0             0                             0                        0
    1     13116 rs62635286    0    1      0             0                             0                        0
    

Output Format#

1. enrichment_results.rds#

  • Format: RDS file containing the following components:

    • summary: A data frame summarizing the OR, OR_SE, Enrichment, and Enrichment_SE for each annotation column.

      Annotation                      OR      OR_SE   Enrichment   Enrichment_SE
      Coding_UCSC                    1.23    0.12    0.85         0.10
      Conserved_LindbladToh          0.98    0.08    1.12         0.05
      Human_Enhancer_Villar          1.45    0.15    1.30         0.12
      
    • OR_blockJacknife: A matrix (22 rows for chromosomes × annotation columns) of log2-transformed OR values.

      Coding_UCSC   Conserved_LindbladToh   Human_Enhancer_Villar
      0.12          -0.02                  0.25
      0.15           0.01                  0.18                                   
    • Enrichment_blockJacknife: A matrix (22 rows for chromosomes × annotation columns) of enrichment values.

    • OR: A numeric vector of mean log2-transformed OR values across chromosomes for each annotation column.

    • Enrichment: A numeric vector of mean enrichment values across chromosomes for each annotation column.

    • OR_sd: A numeric vector of standard errors for OR values across chromosomes for each annotation column.

    • Enrichment_sd: A numeric vector of standard errors for enrichment values across chromosomes for each annotation column.

    • annotations: A list of annotation column names.

Minimal Working Example Steps#

sos run xqtl-protocol/pipeline/functional_enrichment.ipynb enrichment\
    --significant_variants_path /path/to/significant_variants.rds \
    --baseline_anno_path ./Anno_Enrichment/Baseline_anno_each.rds \
    --cwd output

Command interface#

sos run functional_enrichment.ipynb -h

Workflow implementation#

[global]
# Path to the work directory of the analysis.
parameter: cwd = path('output')

parameter: significant_variants_path = path
parameter: baseline_anno_path = path
# Number of threads
parameter: numThreads = 8
# For cluster jobs, number commands to run per job
parameter: job_size = 1
parameter: walltime = '12h'
parameter: mem = '16G'
[enrichment]
parameter: annotations_start = 7
input: significant_variants_path, baseline_anno_path
output: enrichment = f'{cwd:a}/{step_name}/{name}.enrichment_results.rds'

task: trunk_workers = 1, trunk_size = job_size, walltime = walltime, mem = mem, cores = numThreads, tags = f'{step_name}_{_output[0]:bnn}'
R: expand = '${ }', stderr = f'{_output[0]}.stderr', stdout = f'{_output[0]}.stdout', container = container, entrypoint = entrypoint
    library(tidyverse)

    # Helper function for enrichment calculation
    calculate_OR_enrichment <- function(set1, set2, target_set = NULL){
        if (is.null(target_set)){
            target_set <- unique(union(set1, set2))
        }
        A <- intersect(set1, target_set)
        B <- intersect(set2, target_set)
        AB <- intersect(A, B)
        AnoB <- setdiff(A, AB)
        noAB <- setdiff(B, AB)
        noAnoB <- setdiff(target_set, c(A,B))
        
        if (length(noAB) == 0 || length(AnoB) == 0) {
            OR <- Enrichment <- 1
        } else {
            OR <- (length(AB) / length(AnoB)) * (length(noAnoB) / length(noAB))
            Den <- length(A) / length(target_set)
            Num <- length(AB) / length(B)
            Enrichment <- Num / Den
        }
        
        return(list("OR" = OR,
                   "Enrichment" = Enrichment))
    }

    # Start timing
    start_time <- Sys.time()
    print(paste("Job started at:", start_time))

    # Load input data
    print("Loading input data...")
    coloc_anno <- readRDS("${significant_variants_path}")
    baseline <- readRDS("${baseline_anno_path}")
    print("Data loaded successfully!")

    # Process significant variants  
    coloc_anno <- sapply(1:nrow(coloc_anno), function(i) {
        a <- coloc_anno[i,]
        if (is.numeric(a$chr) || grepl("^[0-9]+$", a$chr)) { 
            paste0("chr", a$chr, ":", a$pos)
        } else {
            paste0(a$chr, ":", a$pos)  
        }
    })    
    print("Processed significant variants.")

    # Process baseline annotation
    baseline <- baseline %>%
        mutate(chr_bp = paste0("chr", CHR, ":", BP))%>%
        relocate(chr_bp, .before = 1)
    print("Processed baseline annotation.")

    # Get annotation columns
    annotations_start = ${annotations_start}
    annotations <- colnames(baseline)[annotations_start:ncol(baseline)]
    print(paste("Number of annotations:", length(annotations)))

    # Initialize matrices for results
    OR_blockJacknife <- Enrichment_blockJacknife <- matrix(NA, 
        nrow = 22, 
        ncol = length(annotations))
    colnames(OR_blockJacknife) <- colnames(Enrichment_blockJacknife) <- annotations

    # Perform leave-one-chromosome-out analysis
    print("Starting LOCO analysis...")
    for (i.chr in 1:22){
        chr <- i.chr
        pp <- which(baseline$CHR == chr)
        baseline.jk <- baseline[-pp,]
        target_set <- baseline.jk$chr_bp
        
        for (i in 1:length(annotations)){
            anno <- baseline %>% select(annotations[i])
            pos <- which(anno == 1)
            baseline.tmp <- baseline$chr_bp[pos]
            res <- calculate_OR_enrichment(baseline.tmp, coloc_anno, target_set = target_set)
            OR_blockJacknife[i.chr, i] <- res$OR
            Enrichment_blockJacknife[i.chr, i] <- res$Enrichment
        }
        print(paste("Processed chromosome", i.chr, "of 22"))
    }

    # Calculate final statistics
    print("Calculating final statistics...")
    OR <- colMeans(log2(OR_blockJacknife), na.rm = TRUE)
    Enrichment <- colMeans(Enrichment_blockJacknife, na.rm = TRUE)

    OR_sd <- Enrichment_sd <- numeric(length(annotations))
    for (j in 1:length(annotations)){
        OR_sd[j] <- sqrt(var(OR_blockJacknife[,j], na.rm = TRUE) * 21^2 / 22)
        Enrichment_sd[j] <- sqrt(var(Enrichment_blockJacknife[,j], na.rm = TRUE) * 21^2 / 22)
    }

    # Create summary data frame
    summary_df <- data.frame(
        Annotation = annotations,
        OR = exp(OR),
        OR_SE = OR_sd,
        Enrichment = Enrichment,
        Enrichment_SE = Enrichment_sd
    )
    print("Summary data frame created.")

    # Prepare results
    results <- list(
        "summary" = summary_df,
        "OR_blockJacknife" = OR_blockJacknife,
        "Enrichment_blockJacknife" = Enrichment_blockJacknife,
        "OR" = OR,
        "Enrichment" = Enrichment,
        "OR_sd" = OR_sd,
        "Enrichment_sd" = Enrichment_sd,
        "annotations" = annotations
    )
    print("Results prepared.")

    # Save results
    saveRDS(results, '${_output['enrichment']}', compress='xz')
    print(paste("Results saved to:", '${_output['enrichment']}'))

    # End timing
    end_time <- Sys.time()
    print(paste("Job ended at:", end_time))
    print(paste("Total time elapsed:", as.numeric(difftime(end_time, start_time, units = "mins")), "minutes"))