Pathway Analysis#

1. Purpose#

This workflow performs comprehensive pathway enrichment analysis using both the KEGG database and Gene Ontology (GO) annotations through the clusterProfiler package in R. It identifies biologically significant pathways and functional categories associated with groups of genes based on their enrichment patterns.

Key objectives:

  • Analyze biological significance of gene groups across multiple pathway databases

  • Perform enrichment analysis for KEGG pathways and all three GO ontologies

  • Generate unified, standardized output for downstream interpretation and comparison

  • Support multi-group comparative analysis (e.g., cell types, conditions, treatments)


2. Input#

Gene List (TSV format)#

A table containing grouped genes with the following structure:

Required columns:

  • group: The group/category to which a gene belongs (e.g., AD_A_cauchy, TL1_C_cauchy)

  • gene_id: ENSEMBL gene IDs for the corresponding genes

Example:

group           gene_id
AD     ENSG00000139618
AD    ENSG00000091831
TL1    ENSG00000196839
TL1   ENSG0000008123

Parameters:#

  • pvalue_cutoff: Cutoff for pathway enrichment significance (default: 1)

  • organism: KEGG organism code (default: hsa for humans)


3. Output#

Combined Pathway Results File#

A compressed RDS file (.rds) containing standardized enrichment results from all analyses.

Key columns in the results:

  • ID: Pathway/GO term identifier

  • Description: Pathway/GO term name/description

  • GeneRatio: Ratio of input genes found in the pathway/term

  • BgRatio: Background ratio of all genes in the database for this pathway/term

  • pvalue: Raw enrichment p-value

  • p.adjust: Adjusted p-value (Benjamini-Hochberg)

  • qvalue: Q-value for FDR control

  • group: Name of the gene group analyzed

  • analysis_type: Type of analysis (KEGG or GO)

  • ont_category: Ontology category (PATHWAY, BP, CC, or MF)

  • category: High-level category classification

  • subcategory: Detailed subcategory classification


4. Workflow Steps#

4.1 Data Preparation#

  • Load input TSV file containing grouped ENSEMBL gene IDs

  • Extract unique groups for systematic processing

4.2 Gene ID Conversion#

  • Convert ENSEMBL gene IDs to ENTREZ IDs using org.Hs.eg.db

  • Filter out genes without valid ENTREZ mappings

  • Maintain group associations throughout conversion

4.3 Multi-Database Enrichment Analysis#

KEGG Pathway Analysis:

  • Perform KEGG pathway enrichment using enrichKEGG function

  • Identify enriched metabolic and signaling pathways

Gene Ontology Analysis:

  • Biological Process (BP): Analyze enriched biological processes

  • Cellular Component (CC): Identify enriched cellular localizations

  • Molecular Function (MF): Examine enriched molecular activities

4.4 Result Standardization#

  • Harmonize column structures across all analysis types

  • Add metadata columns for analysis type and ontology category

  • Ensure consistent data formatting for downstream analysis

4.5 Data Integration and Output#

  • Combine all enrichment results into a unified data frame

  • Generate summary statistics by group and analysis type

  • Save results as compressed RDS file


5. Methods#

Enrichment Analysis Approach#

  • KEGG Analysis: Uses enrichKEGG() from clusterProfiler to test for pathway over-representation

  • GO Analysis: Uses enrichGO() with human annotation database (org.Hs.eg.db)

  • Statistical Testing: Hypergeometric test with multiple testing correction

  • Multiple Testing Correction: Benjamini-Hochberg FDR adjustment

Gene Mapping Strategy#

  • Primary conversion: ENSEMBL → ENTREZ using org.Hs.eg.db

  • Quality control: Remove unmapped genes to ensure analysis accuracy

  • Coverage tracking: Monitor conversion success rates per group

Output Standardization#

  • Unified column schema across all analysis types

  • Consistent metadata annotation for easy filtering and subsetting

  • Hierarchical classification system (category → subcategory → specific term)

Minimal Working Example Steps#

sos run pipeline/gsea.ipynb pathway_analysis \
    --genes_file data/pathway/test_pathway_genes_input.tsv \
    --cwd output/pathway_results --name test_genes

Command interface#

!sos run xqtl-protocol/code/enrichment/pathway.ipynb -h
usage: sos run D:/research_code/software/pecotmr_vqtl/xqtl-protocol/code/enrichment/pathway.ipynb
               [workflow_name | -t targets] [options] [workflow_options]
  workflow_name:        Single or combined workflows defined in this script
  targets:              One or more targets to generate
  options:              Single-hyphen sos parameters (see "sos run -h" for details)
  workflow_options:     Double-hyphen workflow-specific parameters

Workflows:
  pathway_analysis

Global Workflow Options:
  --cwd output (as path)
                        Path to the work directory of the analysis.
  --genes-file . (as path)
  --name VAL (as str, required)
  --numThreads 8 (as int)
                        Number of threads
  --job-size 1 (as int)
                        For cluster jobs, number commands to run per job
  --walltime 12h
  --mem 16G
  --container ''
                        Container option for software to run the analysis:
                        docker or singularity
  --entrypoint  ('micromamba run -a "" -n' + ' ' + re.sub(r'(_apptainer:latest|_docker:latest|\.sif)$', '', container.split('/')[-1])) if container else ""


Sections
  pathway_analysis:
    Workflow Options:
      --pvalue-cutoff 1 (as int)
      --organism hsa

Workflow implementation#

[global]
# Path to the work directory of the analysis.
parameter: cwd = path('output')

parameter: genes_file = path()  # TSV file with columns: group, gene_id
parameter: name = str
# Number of threads
parameter: numThreads = 8
# For cluster jobs, number commands to run per job
parameter: job_size = 1
parameter: walltime = '12h'
parameter: mem = '16G'
[pathway_analysis]
parameter: pvalue_cutoff = 1
parameter: organism = 'hsa'
output: pathway_results = f'{cwd:a}/{step_name}/{name}.combined_pathway_results.rds'

task: trunk_workers = 1, trunk_size = job_size, walltime = walltime, mem = mem, cores = numThreads, tags = f'{step_name}_{_output[0]:bnn}'
R: expand = '${ }', stderr = f'{_output[0]}.stderr', stdout = f'{_output[0]}.stdout'
    library(org.Hs.eg.db)
    library(AnnotationDbi)
    library(clusterProfiler)
    library(dplyr)

    # Parameters
    organism <- '${organism}'
    pvalue_cutoff <- ${pvalue_cutoff}

    # Read input data
    gene_data <- read.table('${genes_file}', header = TRUE, sep = "\t", stringsAsFactors = FALSE)

    # ID conversion function
    convert_ids <- function(gene_list) {
        entrez_ids <- mapIds(org.Hs.eg.db, 
                        keys = gene_list, 
                        column = "ENTREZID", 
                        keytype = "ENSEMBL")
        return(na.omit(unique(entrez_ids)))
    }

    # Standardize result columns
    standardize_columns <- function(result_df, analysis_type, ont_category, group_name) {
        if(is.null(result_df) || nrow(result_df) == 0) {
            return(data.frame())
        }
        
        result_df$group <- group_name
        result_df$analysis_type <- analysis_type
        result_df$ont_category <- ont_category
        
        # Add category and subcategory based on analysis type
        if(analysis_type == "KEGG") {
            result_df$category <- NA
            result_df$subcategory <- NA
        } else {
            result_df$category <- "Gene Ontology"
            result_df$subcategory <- switch(ont_category,
                "BP" = "Biological Process",
                "CC" = "Cellular Component", 
                "MF" = "Molecular Function",
                NA)
        }
        
        return(result_df)
    }

    # KEGG enrichment
    perform_kegg <- function(group_genes, group_name) {
        entrez_ids <- convert_ids(group_genes)
        if(length(entrez_ids) == 0) return(data.frame())
        
        enriched <- tryCatch({
            enrichKEGG(gene = entrez_ids, organism = organism, pvalueCutoff = pvalue_cutoff)
        }, error = function(e) NULL)
        
        if(is.null(enriched)) return(data.frame())
        
        result_df <- as.data.frame(enriched)
        return(standardize_columns(result_df, "KEGG", "PATHWAY", group_name))
    }

    # GO enrichment
    perform_go <- function(group_genes, group_name, ont_type) {
        entrez_ids <- convert_ids(group_genes)
        if(length(entrez_ids) == 0) return(data.frame())
        
        enriched <- tryCatch({
            enrichGO(gene = entrez_ids, OrgDb = org.Hs.eg.db, ont = ont_type, 
                    pvalueCutoff = pvalue_cutoff, readable = TRUE)
        }, error = function(e) NULL)
        
        if(is.null(enriched)) return(data.frame())
        
        result_df <- as.data.frame(enriched)
        return(standardize_columns(result_df, "GO", ont_type, group_name))
    }

    # Main analysis
    unique_groups <- unique(gene_data$group)
    all_results <- list()

    # Process all enrichments
    for(group_name in unique_groups) {
        group_genes <- gene_data$gene_id[gene_data$group == group_name]
        
        # KEGG
        kegg_result <- perform_kegg(group_genes, group_name)
        if(nrow(kegg_result) > 0) {
            all_results[[paste0(group_name, "_KEGG")]] <- kegg_result
        }
        
        # GO analyses
        for(ont_type in c("BP", "CC", "MF")) {
            go_result <- perform_go(group_genes, group_name, ont_type)
            if(nrow(go_result) > 0) {
                all_results[[paste0(group_name, "_GO_", ont_type)]] <- go_result
            }
        }
    }

    # Combine results
    if(length(all_results) > 0) {
        combined_results <- do.call(rbind, all_results)
        rownames(combined_results) <- NULL
    } else {
        # Create empty standardized dataframe
        combined_results <- data.frame(
            category = character(), subcategory = character(),
            ID = character(), Description = character(),
            GeneRatio = character(), BgRatio = character(),
            RichFactor = numeric(), FoldEnrichment = numeric(),
            zScore = numeric(), pvalue = numeric(),
            p.adjust = numeric(), qvalue = numeric(),
            geneID = character(), Count = integer(),
            group = character(), analysis_type = character(),
            ont_category = character(),
            stringsAsFactors = FALSE
        )
    }

    # Summary statistics
    print("Summary by group:")
    print(table(combined_results$group))

    print("Cross-tabulation by analysis type and ontology:")
    print(table(combined_results$analysis_type, combined_results$ont_category))

    print("First few rows of combined results:")
    print(head(combined_results, 3))

    # Save results
    saveRDS(combined_results, '${_output['pathway_results']}', compress='xz')
    print(paste("Results saved to:", '${_output['pathway_results']}'))