Pathway Analysis#
Description#
Pathway enrichment analysis over groups of genes using both the KEGG database and all three Gene Ontology (GO) ontologies, via the clusterProfiler R package. It identifies the biologically significant pathways and functional categories associated with each gene group, and supports multi-group comparison (e.g. cell types, conditions, treatments) with a single unified output.
Methods#
Gene ID conversion. Input ENSEMBL gene IDs are converted to ENTREZ IDs using org.Hs.eg.db; genes without a valid ENTREZ mapping are dropped, and group associations are preserved through the conversion.
Enrichment testing. KEGG over-representation uses enrichKEGG() (organism hsa); GO uses enrichGO() against org.Hs.eg.db for Biological Process (BP), Cellular Component (CC), and Molecular Function (MF). Both apply a hypergeometric test with Benjamini-Hochberg FDR correction.
Output standardization. Results from all analysis types are harmonized into one column schema with metadata columns (analysis type, ontology category) and a hierarchical category -> subcategory -> term classification, then combined into a single data frame and saved as a compressed RDS file with per-group summary statistics.
Input#
Gene List (TSV format)#
A table containing grouped genes with the following structure:
Required columns:
group: The group/category to which a gene belongs (e.g.,AD_A_cauchy,TL1_C_cauchy)gene_id: ENSEMBL gene IDs for the corresponding genes
Example:
group gene_id
AD ENSG00000139618
AD ENSG00000091831
TL1 ENSG00000196839
TL1 ENSG0000008123
Parameters:#
pvalue_cutoff: Cutoff for pathway enrichment significance (default: 1)organism: KEGG organism code (default:hsafor humans)
Output#
Combined Pathway Results File#
A compressed RDS file (.rds) containing standardized enrichment results from all analyses.
Key columns in the results:
ID: Pathway/GO term identifierDescription: Pathway/GO term name/descriptionGeneRatio: Ratio of input genes found in the pathway/termBgRatio: Background ratio of all genes in the database for this pathway/termpvalue: Raw enrichment p-valuep.adjust: Adjusted p-value (Benjamini-Hochberg)qvalue: Q-value for FDR controlgroup: Name of the gene group analyzedanalysis_type: Type of analysis (KEGGorGO)ont_category: Ontology category (PATHWAY,BP,CC, orMF)category: High-level category classificationsubcategory: Detailed subcategory classification
Steps#
Step 1. Run KEGG + GO over-representation (pathway) analysis on the grouped gene sets (pathway_analysis). Genes are mapped ENSEMBL→ENTREZ, then enrichKEGG/enrichGO are run per group and the results are combined into one RDS. Note: KEGG enrichment queries the online KEGG API, so this step needs network access; GO enrichment uses the local org.Hs.eg.db.
Timing: ~1-2 min on typical compute infrastructure.
sos run pipeline/gsea.ipynb pathway_analysis \
--genes_file input/enrichment/protocol_example.pathway_genes.tsv \
--name protocol_example \
--pvalue_cutoff 1 --organism hsa \
--cwd output/pathway_analysis
Anticipated Results#
Writes a single compressed .rds with standardized KEGG and GO (BP/CC/MF) enrichment results across all gene groups, sharing one column schema for downstream comparison. Each group yields the pathways and ontology terms over-represented among its genes.
Command interface#
sos run pipeline/gsea.ipynb -h
Workflow implementation#
[global]
# Path to the work directory of the analysis.
parameter: cwd = path('output')
parameter: genes_file = path() # TSV file with columns: group, gene_id
parameter: name = str
# Number of threads
parameter: numThreads = 8
# For cluster jobs, number commands to run per job
parameter: job_size = 1
parameter: walltime = '12h'
parameter: mem = '16G'
[pathway_analysis]
parameter: pvalue_cutoff = 1
parameter: organism = 'hsa'
output: pathway_results = f'{cwd:a}/{step_name}/{name}.combined_pathway_results.rds'
task: trunk_workers = 1, trunk_size = job_size, walltime = walltime, mem = mem, cores = numThreads, tags = f'{step_name}_{_output[0]:bnn}'
R: expand = '${ }', stderr = f'{_output[0]}.stderr', stdout = f'{_output[0]}.stdout'
library(org.Hs.eg.db)
library(AnnotationDbi)
library(clusterProfiler)
library(dplyr)
# Parameters
organism <- '${organism}'
pvalue_cutoff <- ${pvalue_cutoff}
# Read input data
gene_data <- read.table('${genes_file}', header = TRUE, sep = "\t", stringsAsFactors = FALSE)
# ID conversion function
convert_ids <- function(gene_list) {
entrez_ids <- mapIds(org.Hs.eg.db,
keys = gene_list,
column = "ENTREZID",
keytype = "ENSEMBL")
return(na.omit(unique(entrez_ids)))
}
# Standardize result columns
standardize_columns <- function(result_df, analysis_type, ont_category, group_name) {
if(is.null(result_df) || nrow(result_df) == 0) {
return(data.frame())
}
result_df$group <- group_name
result_df$analysis_type <- analysis_type
result_df$ont_category <- ont_category
# Add category and subcategory based on analysis type
if(analysis_type == "KEGG") {
result_df$category <- NA
result_df$subcategory <- NA
} else {
result_df$category <- "Gene Ontology"
result_df$subcategory <- switch(ont_category,
"BP" = "Biological Process",
"CC" = "Cellular Component",
"MF" = "Molecular Function",
NA)
}
return(result_df)
}
# KEGG enrichment
perform_kegg <- function(group_genes, group_name) {
entrez_ids <- convert_ids(group_genes)
if(length(entrez_ids) == 0) return(data.frame())
enriched <- tryCatch({
enrichKEGG(gene = entrez_ids, organism = organism, pvalueCutoff = pvalue_cutoff)
}, error = function(e) NULL)
if(is.null(enriched)) return(data.frame())
result_df <- as.data.frame(enriched)
return(standardize_columns(result_df, "KEGG", "PATHWAY", group_name))
}
# GO enrichment
perform_go <- function(group_genes, group_name, ont_type) {
entrez_ids <- convert_ids(group_genes)
if(length(entrez_ids) == 0) return(data.frame())
enriched <- tryCatch({
enrichGO(gene = entrez_ids, OrgDb = org.Hs.eg.db, ont = ont_type,
pvalueCutoff = pvalue_cutoff, readable = TRUE)
}, error = function(e) NULL)
if(is.null(enriched)) return(data.frame())
result_df <- as.data.frame(enriched)
return(standardize_columns(result_df, "GO", ont_type, group_name))
}
# Main analysis
unique_groups <- unique(gene_data$group)
all_results <- list()
# Process all enrichments
for(group_name in unique_groups) {
group_genes <- gene_data$gene_id[gene_data$group == group_name]
# KEGG
kegg_result <- perform_kegg(group_genes, group_name)
if(nrow(kegg_result) > 0) {
all_results[[paste0(group_name, "_KEGG")]] <- kegg_result
}
# GO analyses
for(ont_type in c("BP", "CC", "MF")) {
go_result <- perform_go(group_genes, group_name, ont_type)
if(nrow(go_result) > 0) {
all_results[[paste0(group_name, "_GO_", ont_type)]] <- go_result
}
}
}
# Combine results
if(length(all_results) > 0) {
combined_results <- do.call(rbind, all_results)
rownames(combined_results) <- NULL
} else {
# Create empty standardized dataframe
combined_results <- data.frame(
category = character(), subcategory = character(),
ID = character(), Description = character(),
GeneRatio = character(), BgRatio = character(),
RichFactor = numeric(), FoldEnrichment = numeric(),
zScore = numeric(), pvalue = numeric(),
p.adjust = numeric(), qvalue = numeric(),
geneID = character(), Count = integer(),
group = character(), analysis_type = character(),
ont_category = character(),
stringsAsFactors = FALSE
)
}
# Summary statistics
print("Summary by group:")
print(table(combined_results$group))
print("Cross-tabulation by analysis type and ontology:")
print(table(combined_results$analysis_type, combined_results$ont_category))
print("First few rows of combined results:")
print(head(combined_results, 3))
# Save results
saveRDS(combined_results, '${_output['pathway_results']}', compress='xz')
print(paste("Results saved to:", '${_output['pathway_results']}'))