Chromosome-Specific Enrichment Analysis of Annotations Using Block Jackknife#
Description#
This notebook performs a chromosome-specific enrichment analysis for genomic annotations using a block jackknife approach. It computes odds ratios (OR) and enrichment statistics for each annotation column by systematically leaving out one chromosome at a time and recalculating the statistics. The analysis provides insight into the overlap and significance of annotations in relation to significant variants within a genomic region.
Key features:
Helper function to compute odds ratios and enrichment.calculate_OR_enrichment
Chromosome-level leave-one-out cross-validation.
Summary statistics, including means and standard errors for OR and enrichment values.
The results are saved as an RDS file for downstream analysis.
Definitions and Test Statistics#
Odds Ratio (OR)#
The Odds Ratio (OR) quantifies the strength of association between significant variants and a specific annotation.
Formula: $\( OR = \frac{\left| AB \right| / \left| A \setminus B \right|}{\left| \text{noA-noB} \right| / \left| B \setminus A \right|} \)$
Where:
\(A\): The set of SNPs within the annotation.
\(B\): The set of significant SNPs.
\(AB\): The intersection of \(A\) and \(B\) (i.e., significant SNPs in the annotation).
\(A \setminus B\): SNPs in the annotation but not significant.
\(B \setminus A\): Significant SNPs not in the annotation.
\(\text{noA-noB}\): SNPs that are neither in the annotation nor significant.
Enrichment#
The Enrichment evaluates whether a genomic annotation contains a higher proportion of significant SNPs than expected by chance.
Formula: $\( \text{Enrichment} = \frac{\text{Proportion of significant SNPs in the annotation}}{\text{Proportion of all SNPs in the annotation}} \)$
Or equivalently: $\( \text{Enrichment} = \frac{\frac{\left| AB \right|}{\left| B \right|}}{\frac{\left| A \right|}{\left| \text{Target Set} \right|}} \)$
Where:
\(\left| AB \right|\): Significant SNPs in the annotation.
\(\left| B \right|\): Total number of significant SNPs.
\(\left| A \right|\): Total number of SNPs in the annotation.
\(\left| \text{Target Set} \right|\): Total number of SNPs in the genome or study region.
Standard Error (SE) Computation#
Leave-One-Chromosome-Out (LOCO) Jackknife#
The LOCO method estimates the standard error by removing one chromosome at a time and recomputing the test statistic, capturing variability due to genomic structure.
Steps:
For each chromosome \(i\):
Remove chromosome \(i\) from the dataset.
Compute \(OR_i\) and \(\text{Enrichment}_i\) using the remaining chromosomes.
Aggregate \(OR_i\) and \(\text{Enrichment}_i\) to compute the mean and SE.
SE Formula#
Using the LOCO estimates (\(\theta_i\) for each chromosome \(i\)): $\( \text{SE}(\theta) = \sqrt{\frac{\sum_{i=1}^{N} (\theta_i - \bar{\theta})^2}{N \cdot (N-1)}} \)$
Where:
\(\theta_i\): Statistic (\(OR_i\) or \(\text{Enrichment}_i\)) excluding chromosome \(i\).
\(\bar{\theta}\): Mean of \(\theta_i\) across chromosomes.
\(N\): Number of chromosomes (e.g., 22 for autosomes).
Computational Workflow#
Step 1: Odds Ratio and Enrichment Computation#
Compute \(OR\) and \(\text{Enrichment}\) using the formulas above.
Repeat for each chromosome using the LOCO approach.
Step 2: Aggregation#
Compute mean \(OR\) and \(\text{Enrichment}\) across chromosomes.
Estimate SE using the jackknife method.
Step 3: Summary Outputs#
Generate summary statistics for each annotation, including:
Mean \(OR\), SE of \(OR\).
Mean \(\text{Enrichment}\), SE of \(\text{Enrichment}\).
Input Format#
1. significant_variants_path
#
Format: RDS file containing significant variants.
Columns:
chr
: Chromosome number (integer).pos
: Genomic position (integer).
Example:
chr pos 1 12345 1 67890
2. baseline_anno_path
#
Format: RDS file containing a tabular data frame with baseline annotations.
Columns:
CHR: Chromosome number (integer).
BP: Genomic base pair position (integer).
SNP: SNP ID (character).
CM: Centimorgan position (numeric).
base: Base-level information (integer).
Annotation columns: Binary columns (0/1) for various genomic annotations (e.g.,
Coding_UCSC
,Conserved_LindbladToh
,CTCF_Hoffman
, etc.).Flanking regions: Binary columns indicating the presence of annotations within 500 bp of the main region (e.g.,
Coding_UCSC.flanking.500
,Conserved_LindbladToh.flanking.500
, etc.).Additional annotation metrics: Annotation statistics like species enhancer counts and promoter scores (
Human_Enhancer_Villar_Species_Enhancer_Count
,Human_Promoter_Villar_ExAC
, etc.).
Example:
CHR BP SNP CM base Coding_UCSC Coding_UCSC.flanking.500 ⋯ Human_Enhancer_Villar Human_Enhancer_Villar.flanking.500 1 11008 rs575272151 0 1 0 0 ⋯ 0 0 1 11012 rs544419019 0 1 0 0 ⋯ 0 0 1 13110 rs540538026 0 1 0 0 ⋯ 0 0 1 13116 rs62635286 0 1 0 0 ⋯ 0 0
Output Format#
1. enrichment_results.rds
#
Format: RDS file containing the following components:
summary
: A data frame summarizing the OR, OR_SE, Enrichment, and Enrichment_SE for each annotation column.Annotation OR OR_SE Enrichment Enrichment_SE Coding_UCSC 1.23 0.12 0.85 0.10 Conserved_LindbladToh 0.98 0.08 1.12 0.05 Human_Enhancer_Villar 1.45 0.15 1.30 0.12
OR_blockJacknife
: A matrix (22 rows for chromosomes × annotation columns) of log2-transformed OR values.Coding_UCSC Conserved_LindbladToh Human_Enhancer_Villar 0.12 -0.02 0.25 0.15 0.01 0.18 ⋮ ⋮ ⋮
Enrichment_blockJacknife
: A matrix (22 rows for chromosomes × annotation columns) of enrichment values.OR
: A numeric vector of mean log2-transformed OR values across chromosomes for each annotation column.Enrichment
: A numeric vector of mean enrichment values across chromosomes for each annotation column.OR_sd
: A numeric vector of standard errors for OR values across chromosomes for each annotation column.Enrichment_sd
: A numeric vector of standard errors for enrichment values across chromosomes for each annotation column.annotations
: A list of annotation column names.
Minimal Working Example Steps#
sos run xqtl-protocol/pipeline/functional_enrichment.ipynb enrichment\
--significant_variants_path /path/to/significant_variants.rds \
--baseline_anno_path ./Anno_Enrichment/Baseline_anno_each.rds \
--cwd output
Command interface#
sos run functional_enrichment.ipynb -h
Workflow implementation#
[global]
# Path to the work directory of the analysis.
parameter: cwd = path('output')
parameter: significant_variants_path = path
parameter: baseline_anno_path = path
# Number of threads
parameter: numThreads = 8
# For cluster jobs, number commands to run per job
parameter: job_size = 1
parameter: walltime = '12h'
parameter: mem = '16G'
[enrichment]
parameter: annotations_start = 7
input: significant_variants_path, baseline_anno_path
output: enrichment = f'{cwd:a}/{step_name}/{name}.enrichment_results.rds'
task: trunk_workers = 1, trunk_size = job_size, walltime = walltime, mem = mem, cores = numThreads, tags = f'{step_name}_{_output[0]:bnn}'
R: expand = '${ }', stderr = f'{_output[0]}.stderr', stdout = f'{_output[0]}.stdout', container = container, entrypoint = entrypoint
library(tidyverse)
# Helper function for enrichment calculation
calculate_OR_enrichment <- function(set1, set2, target_set = NULL){
if (is.null(target_set)){
target_set <- unique(union(set1, set2))
}
A <- intersect(set1, target_set)
B <- intersect(set2, target_set)
AB <- intersect(A, B)
AnoB <- setdiff(A, AB)
noAB <- setdiff(B, AB)
noAnoB <- setdiff(target_set, c(A,B))
if (length(noAB) == 0 || length(AnoB) == 0) {
OR <- Enrichment <- 1
} else {
OR <- (length(AB) / length(AnoB)) * (length(noAnoB) / length(noAB))
Den <- length(A) / length(target_set)
Num <- length(AB) / length(B)
Enrichment <- Num / Den
}
return(list("OR" = OR,
"Enrichment" = Enrichment))
}
# Start timing
start_time <- Sys.time()
print(paste("Job started at:", start_time))
# Load input data
print("Loading input data...")
coloc_anno <- readRDS("${significant_variants_path}")
baseline <- readRDS("${baseline_anno_path}")
print("Data loaded successfully!")
# Process significant variants
coloc_anno <- sapply(1:nrow(coloc_anno), function(i) {
a <- coloc_anno[i,]
if (is.numeric(a$chr) || grepl("^[0-9]+$", a$chr)) {
paste0("chr", a$chr, ":", a$pos)
} else {
paste0(a$chr, ":", a$pos)
}
})
print("Processed significant variants.")
# Process baseline annotation
baseline <- baseline %>%
mutate(chr_bp = paste0("chr", CHR, ":", BP))%>%
relocate(chr_bp, .before = 1)
print("Processed baseline annotation.")
# Get annotation columns
annotations_start = ${annotations_start}
annotations <- colnames(baseline)[annotations_start:ncol(baseline)]
print(paste("Number of annotations:", length(annotations)))
# Initialize matrices for results
OR_blockJacknife <- Enrichment_blockJacknife <- matrix(NA,
nrow = 22,
ncol = length(annotations))
colnames(OR_blockJacknife) <- colnames(Enrichment_blockJacknife) <- annotations
# Perform leave-one-chromosome-out analysis
print("Starting LOCO analysis...")
for (i.chr in 1:22){
chr <- i.chr
pp <- which(baseline$CHR == chr)
baseline.jk <- baseline[-pp,]
target_set <- baseline.jk$chr_bp
for (i in 1:length(annotations)){
anno <- baseline %>% select(annotations[i])
pos <- which(anno == 1)
baseline.tmp <- baseline$chr_bp[pos]
res <- calculate_OR_enrichment(baseline.tmp, coloc_anno, target_set = target_set)
OR_blockJacknife[i.chr, i] <- res$OR
Enrichment_blockJacknife[i.chr, i] <- res$Enrichment
}
print(paste("Processed chromosome", i.chr, "of 22"))
}
# Calculate final statistics
print("Calculating final statistics...")
OR <- colMeans(log2(OR_blockJacknife), na.rm = TRUE)
Enrichment <- colMeans(Enrichment_blockJacknife, na.rm = TRUE)
OR_sd <- Enrichment_sd <- numeric(length(annotations))
for (j in 1:length(annotations)){
OR_sd[j] <- sqrt(var(OR_blockJacknife[,j], na.rm = TRUE) * 21^2 / 22)
Enrichment_sd[j] <- sqrt(var(Enrichment_blockJacknife[,j], na.rm = TRUE) * 21^2 / 22)
}
# Create summary data frame
summary_df <- data.frame(
Annotation = annotations,
OR = exp(OR),
OR_SE = OR_sd,
Enrichment = Enrichment,
Enrichment_SE = Enrichment_sd
)
print("Summary data frame created.")
# Prepare results
results <- list(
"summary" = summary_df,
"OR_blockJacknife" = OR_blockJacknife,
"Enrichment_blockJacknife" = Enrichment_blockJacknife,
"OR" = OR,
"Enrichment" = Enrichment,
"OR_sd" = OR_sd,
"Enrichment_sd" = Enrichment_sd,
"annotations" = annotations
)
print("Results prepared.")
# Save results
saveRDS(results, '${_output['enrichment']}', compress='xz')
print(paste("Results saved to:", '${_output['enrichment']}'))
# End timing
end_time <- Sys.time()
print(paste("Job ended at:", end_time))
print(paste("Total time elapsed:", as.numeric(difftime(end_time, start_time, units = "mins")), "minutes"))