Chromosome-Specific Enrichment Analysis of Annotations Using Block Jackknife

Chromosome-Specific Enrichment Analysis of Annotations Using Block Jackknife#

Description#

We include a chromosome-specific enrichment analysis for genomic annotations using a block jackknife approach. It computes odds ratios (OR) and enrichment statistics for each annotation column by systematically leaving out one chromosome at a time and recalculating the statistics. The analysis provides insight into the overlap and significance of annotations in relation to significant variants within a genomic region.

Definitions and Test Statistics#

Odds Ratio (OR)#

The Odds Ratio (OR) quantifies the strength of association between significant variants and a specific annotation.

Formula: $$ OR = \frac{\left| AB \right| / \left| A \setminus B \right|}{\left| \text{noA-noB} \right| / \left| B \setminus A \right|} $$

Where:

$A$: The set of SNPs within the annotation.
$B$: The set of significant SNPs.
$AB$: The intersection of $A$ and $B$ (i.e., significant SNPs in the annotation).
$A \setminus B$: SNPs in the annotation but not significant.
$B \setminus A$: Significant SNPs not in the annotation.
$\text{noA-noB}$: SNPs that are neither in the annotation nor significant.

Enrichment#

The Enrichment evaluates whether a genomic annotation contains a higher proportion of significant SNPs than expected by chance.

Formula: $$ \text{Enrichment} = \frac{\text{Proportion of significant SNPs in the annotation}}{\text{Proportion of all SNPs in the annotation}} $$

Or equivalently: $$ \text{Enrichment} = \frac{\frac{\left| AB \right|}{\left| B \right|}}{\frac{\left| A \right|}{\left| \text{Target Set} \right|}} $$

Where:

$\left| AB \right|$: Significant SNPs in the annotation.
$\left| B \right|$: Total number of significant SNPs.
$\left| A \right|$: Total number of SNPs in the annotation.
$\left| \text{Target Set} \right|$: Total number of SNPs in the genome or study region.

Standard Error (SE) Computation#

Leave-One-Chromosome-Out (LOCO) Jackknife#

The LOCO method estimates the standard error by removing one chromosome at a time and recomputing the test statistic, capturing variability due to genomic structure.

Steps:

For each chromosome $i$:
- Remove chromosome $i$ from the dataset.
- Compute $OR_i$ and $\text{Enrichment}_i$ using the remaining chromosomes.
Aggregate $OR_i$ and $\text{Enrichment}_i$ to compute the mean and SE.

SE Formula#

Using the LOCO estimates ($\theta_i$ for each chromosome $i$): $$ \text{SE}(\theta) = \sqrt{\frac{\sum_{i=1}^{N} (\theta_i - \bar{\theta})^2}{N \cdot (N-1)}} $$

Where:

$\theta_i$: Statistic ($OR_i$ or $\text{Enrichment}_i$) excluding chromosome $i$.
$\bar{\theta}$: Mean of $\theta_i$ across chromosomes.
$N$: Number of chromosomes (e.g., 22 for autosomes).

Computational Workflow#

Step 1: Odds Ratio and Enrichment Computation#

Compute $OR$ and $\text{Enrichment}$ using the formulas above.
Repeat for each chromosome using the LOCO approach.

Step 2: Aggregation#

Compute mean $OR$ and $\text{Enrichment}$ across chromosomes.
Estimate SE using the jackknife method.

Step 3: Summary Outputs#

Generate summary statistics for each annotation, including:

Mean $OR$, SE of $OR$.
Mean $\text{Enrichment}$, SE of $\text{Enrichment}$.

Input#

significant_variants_path

Format: RDS file containing significant variants. This must contain some variants that are not in the baseline_anno_path input.
Columns:
- chr: Chromosome number (integer, required).
- pos: Genomic position (integer, required).
Example:
```
chr  pos
1    12345
1    67890
```

baseline_anno_path

Format: RDS file containing a tabular data frame with baseline annotations. This must contain some variants that are not in the significant_variants_path input.
Columns:
- CHR: Chromosome number (integer, required).
- BP: Genomic base pair position (integer, required).
- SNP: SNP ID (character, optional).
- CM: Centimorgan position (numeric, optional).
- base: Base-level information (integer, optional).
- Annotation columns: Binary columns (0/1, required) for various genomic annotations (e.g., Coding_UCSC, Conserved_LindbladToh, CTCF_Hoffman, etc.). Multiple such annnotation columns may exist in the input file. The columns start index of this file is given in the --annotations-start argument.

Example:

CHR   BP    SNP           CM   base   Coding_UCSC   Coding_UCSC.flanking.500   ⋯   Human_Enhancer_Villar   Human_Enhancer_Villar.flanking.500
   11008 rs575272151   0    1      0             0                          ⋯   0                        0
   11012 rs544419019   0    1      0             0                          ⋯   0                        0
   13110 rs540538026   0    1      0             0                          ⋯   0                        0
   13116 rs62635286    0    1      0             0                          ⋯   0                        0

Output#

enrichment_results.rds

Format: RDS file containing the following components:
- summary: A data frame summarizing the OR, OR_SE, Enrichment, and Enrichment_SE for each annotation column.
```
Annotation                      OR      OR_SE   Enrichment   Enrichment_SE
Coding_UCSC                    1.23    0.12    0.85         0.10
Conserved_LindbladToh          0.98    0.08    1.12         0.05
Human_Enhancer_Villar          1.45    0.15    1.30         0.12
```
- OR_blockJacknife: A matrix (22 rows for chromosomes × annotation columns) of log2-transformed OR values.
```
Coding_UCSC   Conserved_LindbladToh   Human_Enhancer_Villar
0.12          -0.02                  0.25
0.15           0.01                  0.18
⋮              ⋮                     ⋮
```
- Enrichment_blockJacknife: A matrix (22 rows for chromosomes × annotation columns) of enrichment values.
- OR: A numeric vector of mean log2-transformed OR values across chromosomes for each annotation column.
- Enrichment: A numeric vector of mean enrichment values across chromosomes for each annotation column.
- OR_sd: A numeric vector of standard errors for OR values across chromosomes for each annotation column.
- Enrichment_sd: A numeric vector of standard errors for enrichment values across chromosomes for each annotation column.
- annotations: A list of annotation column names.

Minimal Working Example Steps#

sos run pipeline/eoo_enrichment.ipynb enrichment \
    --significant_variants_path data/eoo_enrichment/colocboost_binary_vcp0.1_hg38_annotation.tsv.gz \
    --baseline_anno_path data/eoo_enrichment/colocboost_binary_vcp0.1_hg38_annotation_data.tsv \
    --name enrichment_results \
    --cwd output/eoo_enrichment

INFO: Running enrichment: 
INFO: enrichment is completed.
INFO: enrichment output:   /restricted/projectnb/xqtl/xqtl_protocol/toy_xqtl_protocol/output/eoo_enrichment/enrichment/enrichment_results.enrichment_results.rds
INFO: Workflow enrichment (ID=wb945681d54e9f1a9) is executed successfully with 1 completed step.

Command interface#

sos run eoo_enrichment.ipynb -h

usage: sos run eoo_enrichment.ipynb
               [workflow_name | -t targets] [options] [workflow_options]
  workflow_name:        Single or combined workflows defined in this script
  targets:              One or more targets to generate
  options:              Single-hyphen sos parameters (see "sos run -h" for details)
  workflow_options:     Double-hyphen workflow-specific parameters
Workflows:
  enrichment
Global Workflow Options:
  --cwd output (as path)
                        Path to the work directory of the analysis.
  --significant-variants-path VAL (as path, required)
  --baseline-anno-path VAL (as path, required)
  --numThreads 8 (as int)
                        Number of threads
  --name eoo
                        For cluster jobs, number commands to run per job
  --job-size 1 (as int)
  --walltime 12h
  --mem 16G
Sections
  enrichment:
    Workflow Options:
      --annotations-start 7 (as int)

Workflow implementation#

[global]
# Path to the work directory of the analysis.
parameter: cwd = path('output')

parameter: significant_variants_path = path
parameter: baseline_anno_path = path
# Number of threads
parameter: numThreads = 8
# For cluster jobs, number commands to run per job
parameter: name = 'eoo'
parameter: job_size = 1
parameter: walltime = '12h'
parameter: mem = '16G'

[enrichment]
parameter: annotations_start = 7
output: enrichment = f'{cwd:a}/{step_name}/{name}.enrichment_results.rds'

task: trunk_workers = 1, trunk_size = job_size, walltime = walltime, mem = mem, cores = numThreads, tags = f'{step_name}_{_output[0]:bnn}'
R: expand = '${ }', stderr = f'{_output[0]}.stderr', stdout = f'{_output[0]}.stdout'
    library(tidyverse)

    # Helper function to read different file formats
    read_input_file <- function(file_path) {
        # Get full file extension (e.g., "txt.gz")
        full_ext <- sub(".*\\.", "", file_path)
        # Get base extension (e.g., "txt" from "txt.gz")
        base_ext <- tools::file_ext(sub("\\.gz$", "", file_path))
        
        if (full_ext == "rds") {
            return(readRDS(file_path))
        } else if (base_ext %in% c("txt", "tsv")) {
            return(data.table::fread(file_path))
        } else {
            stop(paste("Unsupported file format:", full_ext))
        }
    }

    calculate_OR_enrichment <- function(set1, set2, target_set = NULL){
        if (is.null(target_set)){
            target_set <- unique(union(set1, set2))
        }
        A <- intersect(set1, target_set)
        B <- intersect(set2, target_set)
        AB <- intersect(A, B)
        AnoB <- setdiff(A, AB)
        noAB <- setdiff(B, AB)
        noAnoB <- setdiff(target_set, c(A,B))
        
        if (length(noAB) == 0 || length(AnoB) == 0) {
            OR <- Enrichment <- 1
        } else {
            OR <- (length(AB) / length(AnoB)) * (length(noAnoB) / length(noAB))
            Den <- length(A) / length(target_set)
            Num <- length(AB) / length(B)
            Enrichment <- Num / Den
        }
        
        return(list("OR" = OR,
                   "Enrichment" = Enrichment))
    }

    # Start timing
    start_time <- Sys.time()
    print(paste("Job started at:", start_time))

    # Load input data
    print("Loading input data...")
    your_anno <- read_input_file("${significant_variants_path}")
    baseline <- read_input_file("${baseline_anno_path}")
    print("Data loaded successfully!")

    if ("chr" %in% colnames(baseline) && !"CHR" %in% colnames(baseline)) {
        names(baseline)[names(baseline) == "chr"] <- "CHR"
    }

    if ("pos" %in% colnames(baseline) && !"BP" %in% colnames(baseline)) {
        names(baseline)[names(baseline) == "pos"] <- "BP"
    }

    if (!is.numeric(baseline$CHR)) {
        baseline$CHR <- as.numeric(gsub("chr", "", baseline$CHR))
    }


    # Process significant variants  
    your_anno <- sapply(1:nrow(your_anno), function(i) {
        a <- your_anno[i,]
        if (is.numeric(a$chr) || grepl("^[0-9]+$", a$chr)) { 
            paste0("chr", a$chr, ":", a$pos)
        } else {
            paste0(a$chr, ":", a$pos)  
        }
    })    
    print("Processed significant variants.")

    # Process baseline annotation
    baseline <- baseline %>%
        mutate(chr_bp = paste0("chr", CHR, ":", BP))%>%
        relocate(chr_bp, .before = 1)

    print("Processed baseline annotation.")

    # Get annotation columns
    annotations_start = ${annotations_start}
    annotations <- colnames(baseline)[annotations_start:ncol(baseline)]
    print(paste("Number of annotations:", length(annotations)))

    # Initialize matrices for results
    OR_blockJacknife <- Enrichment_blockJacknife <- matrix(NA, 
        nrow = 22, 
        ncol = length(annotations))
    colnames(OR_blockJacknife) <- colnames(Enrichment_blockJacknife) <- annotations

    # Perform leave-one-chromosome-out analysis
    print("Starting LOCO analysis...")
    for (i.chr in 1:22){
        chr <- i.chr
        pp <- which(baseline$CHR == chr)
        baseline.jk <- baseline[-pp,]
        target_set <- baseline.jk$chr_bp

        for (i in 1:length(annotations)){
            anno <- baseline %>% select(annotations[i])
            pos <- which(anno == 1)
            baseline.tmp <- baseline$chr_bp[pos]
            res <- calculate_OR_enrichment(baseline.tmp, your_anno, target_set = target_set)
            OR_blockJacknife[i.chr, i] <- res$OR
            Enrichment_blockJacknife[i.chr, i] <- res$Enrichment
        }
        print(paste("Processed chromosome", i.chr, "of 22"))
    }

    # Calculate final statistics
    print("Calculating final statistics...")
    OR <- colMeans(log2(OR_blockJacknife), na.rm = TRUE)
    Enrichment <- colMeans(Enrichment_blockJacknife, na.rm = TRUE)
    Enrichment_log2 <- colMeans(log2(Enrichment_blockJacknife), na.rm = TRUE)

    OR_sd <- Enrichment_sd <- OR_sd_log2 <- Enrichment_sd_log2 <- numeric(length(annotations))
    for (j in 1:length(annotations)){
        OR_sd[j] <- sqrt(var(OR_blockJacknife[,j], na.rm = TRUE) * 21^2 / 22)
        Enrichment_sd[j] <- sqrt(var(Enrichment_blockJacknife[,j], na.rm = TRUE) * 21^2 / 22)
        OR_sd_log2[j] <- sqrt(var(log2(OR_blockJacknife[,j]), na.rm = TRUE) * 21^2 / 22)
        Enrichment_sd_log2[j] <- sqrt(var(log2(Enrichment_blockJacknife[,j]), na.rm = TRUE) * 21^2 / 22)
    }

    # Calculate Z-scores and p-values
    Enrichment_z_scores <- Enrichment / Enrichment_sd
    Enrichment_p_values <- pchisq(Enrichment_z_scores^2, 1, lower.tail = FALSE)
    Enrichment_log2_z_scores <- Enrichment_log2 / Enrichment_sd_log2
    Enrichment_log2_p_values <- pchisq(Enrichment_log2_z_scores^2, 1, lower.tail = FALSE)

    # Create summary data frame
    summary_df <- data.frame(
        Annotation = annotations,
        OR = 2^OR,
        OR_SE = OR_sd,
        OR_log2 = OR,
        OR_SE_log2 = OR_sd_log2,
        Enrichment = Enrichment,
        Enrichment_SE = Enrichment_sd,
        Enrichment_log2 = Enrichment_log2,
        Enrichment_SE_log2 = Enrichment_sd_log2,
        Enrichment_Z_score = Enrichment_z_scores,
        Enrichment_P_value = Enrichment_p_values,
        Enrichment_log2_z_scores = Enrichment_log2_z_scores,
        Enrichment_log2_p_values = Enrichment_log2_p_values        
    )
    print("Summary data frame created.")

    # Prepare results
    results <- list(
        "summary" = summary_df,
        "OR_blockJacknife" = OR_blockJacknife,
        "Enrichment_blockJacknife" = Enrichment_blockJacknife,
        "OR" = OR,
        "Enrichment" = Enrichment,
        "OR_sd" = OR_sd,
        "Enrichment_sd" = Enrichment_sd,
        "Enrichment_Z_scores" = Enrichment_z_scores,
        "Enrichment_P_values" = Enrichment_p_values,                
        "annotations" = annotations
    )
    print("Results prepared.")


    # Save results
    saveRDS(results, '${_output['enrichment']}', compress='xz')
    print(paste("Results saved to:", '${_output['enrichment']}'))

    # Save summary table as TSV gz
    summary_tsv_path <- sub("\\.rds$", "_summary.tsv.gz", '${_output['enrichment']}')
    data.table::fwrite(summary_df, summary_tsv_path, sep="\t", quote=FALSE, compress="gzip")
    print(paste("Summary table saved to:", summary_tsv_path))

    # End timing
    end_time <- Sys.time()
    print(paste("Job ended at:", end_time))
    print(paste("Total time elapsed:", as.numeric(difftime(end_time, start_time, units = "mins")), "minutes"))

Chromosome-Specific Enrichment Analysis of Annotations Using Block Jackknife

Contents

Chromosome-Specific Enrichment Analysis of Annotations Using Block Jackknife#

Description#

Definitions and Test Statistics#

Odds Ratio (OR)#

Enrichment#

Standard Error (SE) Computation#

Leave-One-Chromosome-Out (LOCO) Jackknife#

SE Formula#

Computational Workflow#

Step 1: Odds Ratio and Enrichment Computation#

Step 2: Aggregation#

Step 3: Summary Outputs#

Input#

Output#

Minimal Working Example Steps#

Command interface#

Workflow implementation#