Pathway Analysis#
1. Purpose#
This workflow performs comprehensive pathway enrichment analysis using both the KEGG database and Gene Ontology (GO) annotations through the clusterProfiler package in R. It identifies biologically significant pathways and functional categories associated with groups of genes based on their enrichment patterns.
Key objectives:
Analyze biological significance of gene groups across multiple pathway databases
Perform enrichment analysis for KEGG pathways and all three GO ontologies
Generate unified, standardized output for downstream interpretation and comparison
Support multi-group comparative analysis (e.g., cell types, conditions, treatments)
2. Input#
Gene List (TSV format)#
A table containing grouped genes with the following structure:
Required columns:
group: The group/category to which a gene belongs (e.g.,AD_A_cauchy,TL1_C_cauchy)gene_id: ENSEMBL gene IDs for the corresponding genes
Example:
group gene_id
AD ENSG00000139618
AD ENSG00000091831
TL1 ENSG00000196839
TL1 ENSG0000008123
Parameters:#
pvalue_cutoff: Cutoff for pathway enrichment significance (default: 1)organism: KEGG organism code (default:hsafor humans)
3. Output#
Combined Pathway Results File#
A compressed RDS file (.rds) containing standardized enrichment results from all analyses.
Key columns in the results:
ID: Pathway/GO term identifierDescription: Pathway/GO term name/descriptionGeneRatio: Ratio of input genes found in the pathway/termBgRatio: Background ratio of all genes in the database for this pathway/termpvalue: Raw enrichment p-valuep.adjust: Adjusted p-value (Benjamini-Hochberg)qvalue: Q-value for FDR controlgroup: Name of the gene group analyzedanalysis_type: Type of analysis (KEGGorGO)ont_category: Ontology category (PATHWAY,BP,CC, orMF)category: High-level category classificationsubcategory: Detailed subcategory classification
4. Workflow Steps#
4.1 Data Preparation#
Load input TSV file containing grouped ENSEMBL gene IDs
Extract unique groups for systematic processing
4.2 Gene ID Conversion#
Convert ENSEMBL gene IDs to ENTREZ IDs using
org.Hs.eg.dbFilter out genes without valid ENTREZ mappings
Maintain group associations throughout conversion
4.3 Multi-Database Enrichment Analysis#
KEGG Pathway Analysis:
Perform KEGG pathway enrichment using
enrichKEGGfunctionIdentify enriched metabolic and signaling pathways
Gene Ontology Analysis:
Biological Process (BP): Analyze enriched biological processes
Cellular Component (CC): Identify enriched cellular localizations
Molecular Function (MF): Examine enriched molecular activities
4.4 Result Standardization#
Harmonize column structures across all analysis types
Add metadata columns for analysis type and ontology category
Ensure consistent data formatting for downstream analysis
4.5 Data Integration and Output#
Combine all enrichment results into a unified data frame
Generate summary statistics by group and analysis type
Save results as compressed RDS file
5. Methods#
Enrichment Analysis Approach#
KEGG Analysis: Uses
enrichKEGG()fromclusterProfilerto test for pathway over-representationGO Analysis: Uses
enrichGO()with human annotation database (org.Hs.eg.db)Statistical Testing: Hypergeometric test with multiple testing correction
Multiple Testing Correction: Benjamini-Hochberg FDR adjustment
Gene Mapping Strategy#
Primary conversion: ENSEMBL → ENTREZ using
org.Hs.eg.dbQuality control: Remove unmapped genes to ensure analysis accuracy
Coverage tracking: Monitor conversion success rates per group
Output Standardization#
Unified column schema across all analysis types
Consistent metadata annotation for easy filtering and subsetting
Hierarchical classification system (category → subcategory → specific term)
Minimal Working Example Steps#
sos run pipeline/gsea.ipynb pathway_analysis \
--genes_file data/pathway/test_pathway_genes_input.tsv \
--cwd output/pathway_results --name test_genes
Command interface#
!sos run xqtl-protocol/code/enrichment/pathway.ipynb -h
usage: sos run D:/research_code/software/pecotmr_vqtl/xqtl-protocol/code/enrichment/pathway.ipynb
[workflow_name | -t targets] [options] [workflow_options]
workflow_name: Single or combined workflows defined in this script
targets: One or more targets to generate
options: Single-hyphen sos parameters (see "sos run -h" for details)
workflow_options: Double-hyphen workflow-specific parameters
Workflows:
pathway_analysis
Global Workflow Options:
--cwd output (as path)
Path to the work directory of the analysis.
--genes-file . (as path)
--name VAL (as str, required)
--numThreads 8 (as int)
Number of threads
--job-size 1 (as int)
For cluster jobs, number commands to run per job
--walltime 12h
--mem 16G
--container ''
Container option for software to run the analysis:
docker or singularity
--entrypoint ('micromamba run -a "" -n' + ' ' + re.sub(r'(_apptainer:latest|_docker:latest|\.sif)$', '', container.split('/')[-1])) if container else ""
Sections
pathway_analysis:
Workflow Options:
--pvalue-cutoff 1 (as int)
--organism hsa
Workflow implementation#
[global]
# Path to the work directory of the analysis.
parameter: cwd = path('output')
parameter: genes_file = path() # TSV file with columns: group, gene_id
parameter: name = str
# Number of threads
parameter: numThreads = 8
# For cluster jobs, number commands to run per job
parameter: job_size = 1
parameter: walltime = '12h'
parameter: mem = '16G'
[pathway_analysis]
parameter: pvalue_cutoff = 1
parameter: organism = 'hsa'
output: pathway_results = f'{cwd:a}/{step_name}/{name}.combined_pathway_results.rds'
task: trunk_workers = 1, trunk_size = job_size, walltime = walltime, mem = mem, cores = numThreads, tags = f'{step_name}_{_output[0]:bnn}'
R: expand = '${ }', stderr = f'{_output[0]}.stderr', stdout = f'{_output[0]}.stdout'
library(org.Hs.eg.db)
library(AnnotationDbi)
library(clusterProfiler)
library(dplyr)
# Parameters
organism <- '${organism}'
pvalue_cutoff <- ${pvalue_cutoff}
# Read input data
gene_data <- read.table('${genes_file}', header = TRUE, sep = "\t", stringsAsFactors = FALSE)
# ID conversion function
convert_ids <- function(gene_list) {
entrez_ids <- mapIds(org.Hs.eg.db,
keys = gene_list,
column = "ENTREZID",
keytype = "ENSEMBL")
return(na.omit(unique(entrez_ids)))
}
# Standardize result columns
standardize_columns <- function(result_df, analysis_type, ont_category, group_name) {
if(is.null(result_df) || nrow(result_df) == 0) {
return(data.frame())
}
result_df$group <- group_name
result_df$analysis_type <- analysis_type
result_df$ont_category <- ont_category
# Add category and subcategory based on analysis type
if(analysis_type == "KEGG") {
result_df$category <- NA
result_df$subcategory <- NA
} else {
result_df$category <- "Gene Ontology"
result_df$subcategory <- switch(ont_category,
"BP" = "Biological Process",
"CC" = "Cellular Component",
"MF" = "Molecular Function",
NA)
}
return(result_df)
}
# KEGG enrichment
perform_kegg <- function(group_genes, group_name) {
entrez_ids <- convert_ids(group_genes)
if(length(entrez_ids) == 0) return(data.frame())
enriched <- tryCatch({
enrichKEGG(gene = entrez_ids, organism = organism, pvalueCutoff = pvalue_cutoff)
}, error = function(e) NULL)
if(is.null(enriched)) return(data.frame())
result_df <- as.data.frame(enriched)
return(standardize_columns(result_df, "KEGG", "PATHWAY", group_name))
}
# GO enrichment
perform_go <- function(group_genes, group_name, ont_type) {
entrez_ids <- convert_ids(group_genes)
if(length(entrez_ids) == 0) return(data.frame())
enriched <- tryCatch({
enrichGO(gene = entrez_ids, OrgDb = org.Hs.eg.db, ont = ont_type,
pvalueCutoff = pvalue_cutoff, readable = TRUE)
}, error = function(e) NULL)
if(is.null(enriched)) return(data.frame())
result_df <- as.data.frame(enriched)
return(standardize_columns(result_df, "GO", ont_type, group_name))
}
# Main analysis
unique_groups <- unique(gene_data$group)
all_results <- list()
# Process all enrichments
for(group_name in unique_groups) {
group_genes <- gene_data$gene_id[gene_data$group == group_name]
# KEGG
kegg_result <- perform_kegg(group_genes, group_name)
if(nrow(kegg_result) > 0) {
all_results[[paste0(group_name, "_KEGG")]] <- kegg_result
}
# GO analyses
for(ont_type in c("BP", "CC", "MF")) {
go_result <- perform_go(group_genes, group_name, ont_type)
if(nrow(go_result) > 0) {
all_results[[paste0(group_name, "_GO_", ont_type)]] <- go_result
}
}
}
# Combine results
if(length(all_results) > 0) {
combined_results <- do.call(rbind, all_results)
rownames(combined_results) <- NULL
} else {
# Create empty standardized dataframe
combined_results <- data.frame(
category = character(), subcategory = character(),
ID = character(), Description = character(),
GeneRatio = character(), BgRatio = character(),
RichFactor = numeric(), FoldEnrichment = numeric(),
zScore = numeric(), pvalue = numeric(),
p.adjust = numeric(), qvalue = numeric(),
geneID = character(), Count = integer(),
group = character(), analysis_type = character(),
ont_category = character(),
stringsAsFactors = FALSE
)
}
# Summary statistics
print("Summary by group:")
print(table(combined_results$group))
print("Cross-tabulation by analysis type and ontology:")
print(table(combined_results$analysis_type, combined_results$ont_category))
print("First few rows of combined results:")
print(head(combined_results, 3))
# Save results
saveRDS(combined_results, '${_output['pathway_results']}', compress='xz')
print(paste("Results saved to:", '${_output['pathway_results']}'))