Pathway Analysis#

Description#

Pathway enrichment analysis over groups of genes using both the KEGG database and all three Gene Ontology (GO) ontologies, via the clusterProfiler R package. It identifies the biologically significant pathways and functional categories associated with each gene group, and supports multi-group comparison (e.g. cell types, conditions, treatments) with a single unified output.

Methods#

Gene ID conversion. Input ENSEMBL gene IDs are converted to ENTREZ IDs using org.Hs.eg.db; genes without a valid ENTREZ mapping are dropped, and group associations are preserved through the conversion.

Enrichment testing. KEGG over-representation uses enrichKEGG() (organism hsa); GO uses enrichGO() against org.Hs.eg.db for Biological Process (BP), Cellular Component (CC), and Molecular Function (MF). Both apply a hypergeometric test with Benjamini-Hochberg FDR correction.

Output standardization. Results from all analysis types are harmonized into one column schema with metadata columns (analysis type, ontology category) and a hierarchical category -> subcategory -> term classification, then combined into a single data frame and saved as a compressed RDS file with per-group summary statistics.

Input#

Gene List (TSV format)#

A table containing grouped genes with the following structure:

Required columns:

  • group: The group/category to which a gene belongs (e.g., AD_A_cauchy, TL1_C_cauchy)

  • gene_id: ENSEMBL gene IDs for the corresponding genes

Example:

group           gene_id
AD     ENSG00000139618
AD    ENSG00000091831
TL1    ENSG00000196839
TL1   ENSG0000008123

Parameters:#

  • pvalue_cutoff: Cutoff for pathway enrichment significance (default: 1)

  • organism: KEGG organism code (default: hsa for humans)


Output#

Combined Pathway Results File#

A compressed RDS file (.rds) containing standardized enrichment results from all analyses.

Key columns in the results:

  • ID: Pathway/GO term identifier

  • Description: Pathway/GO term name/description

  • GeneRatio: Ratio of input genes found in the pathway/term

  • BgRatio: Background ratio of all genes in the database for this pathway/term

  • pvalue: Raw enrichment p-value

  • p.adjust: Adjusted p-value (Benjamini-Hochberg)

  • qvalue: Q-value for FDR control

  • group: Name of the gene group analyzed

  • analysis_type: Type of analysis (KEGG or GO)

  • ont_category: Ontology category (PATHWAY, BP, CC, or MF)

  • category: High-level category classification

  • subcategory: Detailed subcategory classification


Steps#

Step 1. Run KEGG + GO over-representation (pathway) analysis on the grouped gene sets (pathway_analysis). Genes are mapped ENSEMBL→ENTREZ, then enrichKEGG/enrichGO are run per group and the results are combined into one RDS. Note: KEGG enrichment queries the online KEGG API, so this step needs network access; GO enrichment uses the local org.Hs.eg.db.

Timing: ~1-2 min on typical compute infrastructure.

sos run pipeline/gsea.ipynb pathway_analysis \
    --genes_file input/enrichment/protocol_example.pathway_genes.tsv \
    --name protocol_example \
    --pvalue_cutoff 1 --organism hsa \
    --cwd output/pathway_analysis

Anticipated Results#

Writes a single compressed .rds with standardized KEGG and GO (BP/CC/MF) enrichment results across all gene groups, sharing one column schema for downstream comparison. Each group yields the pathways and ontology terms over-represented among its genes.

Command interface#

sos run pipeline/gsea.ipynb -h

Workflow implementation#

[global]
# Path to the work directory of the analysis.
parameter: cwd = path('output')

parameter: genes_file = path()  # TSV file with columns: group, gene_id
parameter: name = str
# Number of threads
parameter: numThreads = 8
# For cluster jobs, number commands to run per job
parameter: job_size = 1
parameter: walltime = '12h'
parameter: mem = '16G'
[pathway_analysis]
parameter: pvalue_cutoff = 1
parameter: organism = 'hsa'
output: pathway_results = f'{cwd:a}/{step_name}/{name}.combined_pathway_results.rds'

task: trunk_workers = 1, trunk_size = job_size, walltime = walltime, mem = mem, cores = numThreads, tags = f'{step_name}_{_output[0]:bnn}'
R: expand = '${ }', stderr = f'{_output[0]}.stderr', stdout = f'{_output[0]}.stdout'
    library(org.Hs.eg.db)
    library(AnnotationDbi)
    library(clusterProfiler)
    library(dplyr)

    # Parameters
    organism <- '${organism}'
    pvalue_cutoff <- ${pvalue_cutoff}

    # Read input data
    gene_data <- read.table('${genes_file}', header = TRUE, sep = "\t", stringsAsFactors = FALSE)

    # ID conversion function
    convert_ids <- function(gene_list) {
        entrez_ids <- mapIds(org.Hs.eg.db, 
                        keys = gene_list, 
                        column = "ENTREZID", 
                        keytype = "ENSEMBL")
        return(na.omit(unique(entrez_ids)))
    }

    # Standardize result columns
    standardize_columns <- function(result_df, analysis_type, ont_category, group_name) {
        if(is.null(result_df) || nrow(result_df) == 0) {
            return(data.frame())
        }
        
        result_df$group <- group_name
        result_df$analysis_type <- analysis_type
        result_df$ont_category <- ont_category
        
        # Add category and subcategory based on analysis type
        if(analysis_type == "KEGG") {
            result_df$category <- NA
            result_df$subcategory <- NA
        } else {
            result_df$category <- "Gene Ontology"
            result_df$subcategory <- switch(ont_category,
                "BP" = "Biological Process",
                "CC" = "Cellular Component", 
                "MF" = "Molecular Function",
                NA)
        }
        
        return(result_df)
    }

    # KEGG enrichment
    perform_kegg <- function(group_genes, group_name) {
        entrez_ids <- convert_ids(group_genes)
        if(length(entrez_ids) == 0) return(data.frame())
        
        enriched <- tryCatch({
            enrichKEGG(gene = entrez_ids, organism = organism, pvalueCutoff = pvalue_cutoff)
        }, error = function(e) NULL)
        
        if(is.null(enriched)) return(data.frame())
        
        result_df <- as.data.frame(enriched)
        return(standardize_columns(result_df, "KEGG", "PATHWAY", group_name))
    }

    # GO enrichment
    perform_go <- function(group_genes, group_name, ont_type) {
        entrez_ids <- convert_ids(group_genes)
        if(length(entrez_ids) == 0) return(data.frame())
        
        enriched <- tryCatch({
            enrichGO(gene = entrez_ids, OrgDb = org.Hs.eg.db, ont = ont_type, 
                    pvalueCutoff = pvalue_cutoff, readable = TRUE)
        }, error = function(e) NULL)
        
        if(is.null(enriched)) return(data.frame())
        
        result_df <- as.data.frame(enriched)
        return(standardize_columns(result_df, "GO", ont_type, group_name))
    }

    # Main analysis
    unique_groups <- unique(gene_data$group)
    all_results <- list()

    # Process all enrichments
    for(group_name in unique_groups) {
        group_genes <- gene_data$gene_id[gene_data$group == group_name]
        
        # KEGG
        kegg_result <- perform_kegg(group_genes, group_name)
        if(nrow(kegg_result) > 0) {
            all_results[[paste0(group_name, "_KEGG")]] <- kegg_result
        }
        
        # GO analyses
        for(ont_type in c("BP", "CC", "MF")) {
            go_result <- perform_go(group_genes, group_name, ont_type)
            if(nrow(go_result) > 0) {
                all_results[[paste0(group_name, "_GO_", ont_type)]] <- go_result
            }
        }
    }

    # Combine results
    if(length(all_results) > 0) {
        combined_results <- do.call(rbind, all_results)
        rownames(combined_results) <- NULL
    } else {
        # Create empty standardized dataframe
        combined_results <- data.frame(
            category = character(), subcategory = character(),
            ID = character(), Description = character(),
            GeneRatio = character(), BgRatio = character(),
            RichFactor = numeric(), FoldEnrichment = numeric(),
            zScore = numeric(), pvalue = numeric(),
            p.adjust = numeric(), qvalue = numeric(),
            geneID = character(), Count = integer(),
            group = character(), analysis_type = character(),
            ont_category = character(),
            stringsAsFactors = FALSE
        )
    }

    # Summary statistics
    print("Summary by group:")
    print(table(combined_results$group))

    print("Cross-tabulation by analysis type and ontology:")
    print(table(combined_results$analysis_type, combined_results$ont_category))

    print("First few rows of combined results:")
    print(head(combined_results, 3))

    # Save results
    saveRDS(combined_results, '${_output['pathway_results']}', compress='xz')
    print(paste("Results saved to:", '${_output['pathway_results']}'))