Pathway Analysis#

Description#

This workflow performs pathway enrichment analysis using the KEGG database and the clusterProfiler package in R.

1. Purpose#

This workflow performs pathway enrichment analysis using the KEGG database and the clusterProfiler package in R. It identifies biologically significant KEGG pathways associated with groups of genes based on their enrichment in known pathways.

Key objectives:

  • Analyze biological significance of gene groups in a pathway context.

  • Perform enrichment analysis for different groups (e.g., cell types, conditions).

  • Generate a summarized output for downstream interpretation.

2. Input#

  • Gene List (TSV format): A table containing grouped genes.

    • Columns:

      • group: The group/category to which a gene belongs (e.g., Neuron, Microglia).

      • gene_id: ENSEMBL gene IDs for the corresponding genes.

    • Example:

      group       gene_id
      Neuron      ENSG00000139618
      Neuron      ENSG00000091831
      Microglia   ENSG00000196839
      Microglia   ENSG00000081237
      
  • Parameters:

    • pvalue_cutoff: Cutoff for pathway enrichment significance (default: 1).

    • organism: KEGG organism code (default: hsa for humans).

3. Output#

  • Pathway Results File:

    • A compressed RDS file (.rds) containing enrichment results for all groups.

    • Each group has its pathway results stored in a combined data frame.

    • Columns in the results:

      • ID: KEGG pathway ID.

      • Description: KEGG pathway name.

      • GeneRatio: Ratio of genes in the group that are in the pathway.

      • BgRatio: Ratio of all genes in the database associated with the pathway.

      • pvalue: Enrichment p-value.

      • qvalue: Adjusted p-value.

      • group: Name of the gene group.

4. Workflow Steps#

  1. Load Input Data:

    • Read the input TSV file containing grouped ENSEMBL gene IDs.

  2. Gene ID Conversion:

    • Convert ENSEMBL gene IDs to ENTREZ IDs using org.Hs.eg.db.

    • Omit genes without valid mappings.

  3. Pathway Enrichment Analysis:

    • Perform KEGG pathway enrichment analysis using the enrichKEGG function for each group.

    • Apply the specified pvalue_cutoff to filter results.

  4. Combine Results:

    • Merge the enrichment results for all groups into a single data frame.

  5. Save Output:

    • Save the combined results as a compressed RDS file for downstream analysis.

5. Methods#

  • Pathway Enrichment Analysis:

    • The workflow uses the enrichKEGG function from the clusterProfiler package.

    • Genes are mapped to KEGG pathways, and statistical tests are performed to identify significantly enriched pathways.

    • Adjusted p-values (qvalue) control for multiple testing.

  • Gene Conversion:

    • ENSEMBL IDs are converted to ENTREZ IDs using org.Hs.eg.db.

    • Only successfully mapped genes are used for enrichment.

  • Output Integration:

    • Results are organized into a structured data frame with pathway enrichment statistics for each group.

Minimal Working Example Steps#

sos run /home/al4225/project/quantile_twas/analysis/pathway/pathway.ipynb pathway_analysis \
    --genes_file /home/al4225/project/quantile_twas/analysis/pathway/test_data/157_genes_input.tsv \
    --cwd /home/al4225/project/quantile_twas/analysis/pathway/output --name rosmap.157genes -s build

Command interface#

!sos run xqtl-protocol/code/enrichment/pathway.ipynb -h
usage: sos run D:/research_code/software/pecotmr_vqtl/xqtl-protocol/code/enrichment/pathway.ipynb
               [workflow_name | -t targets] [options] [workflow_options]
  workflow_name:        Single or combined workflows defined in this script
  targets:              One or more targets to generate
  options:              Single-hyphen sos parameters (see "sos run -h" for details)
  workflow_options:     Double-hyphen workflow-specific parameters

Workflows:
  pathway_analysis

Global Workflow Options:
  --cwd output (as path)
                        Path to the work directory of the analysis.
  --genes-file . (as path)
  --name VAL (as str, required)
  --numThreads 8 (as int)
                        Number of threads
  --job-size 1 (as int)
                        For cluster jobs, number commands to run per job
  --walltime 12h
  --mem 16G
  --container ''
                        Container option for software to run the analysis:
                        docker or singularity
  --entrypoint  ('micromamba run -a "" -n' + ' ' + re.sub(r'(_apptainer:latest|_docker:latest|\.sif)$', '', container.split('/')[-1])) if container else ""


Sections
  pathway_analysis:
    Workflow Options:
      --pvalue-cutoff 1 (as int)
      --organism hsa

Workflow implementation#

[global]
# Path to the work directory of the analysis.
parameter: cwd = path('output')

parameter: genes_file = path()  # TSV file with columns: group, gene_id
parameter: name = str
# Number of threads
parameter: numThreads = 8
# For cluster jobs, number commands to run per job
parameter: job_size = 1
parameter: walltime = '12h'
parameter: mem = '16G'
[pathway_analysis]
parameter: pvalue_cutoff = 1
parameter: organism = 'hsa'
output: pathway = f'{cwd:a}/{step_name}/{name}.pathway_results.rds'

task: trunk_workers = 1, trunk_size = job_size, walltime = walltime, mem = mem, cores = numThreads, tags = f'{step_name}_{_output[0]:bnn}'
R: expand = '${ }', stderr = f'{_output[0]}.stderr', stdout = f'{_output[0]}.stdout', container = container, entrypoint = entrypoint
    library(org.Hs.eg.db)
    library(AnnotationDbi)
    library(clusterProfiler)
    library(dplyr)
    organism <- '${organism}'
    print(organism)
    pvalue_cutoff <- ${pvalue_cutoff}
    print(pvalue_cutoff)
    # Read input TSV file
    gene_data <- read.table('${genes_file}', 
                           header = TRUE, 
                           sep = "\t", 
                           stringsAsFactors = FALSE)
    
    # Function to convert and perform enrichment for one group
    process_gene_list <- function(group_genes, group_name) {
        # Convert ENSEMBL to ENTREZ
        entrez_ids <- mapIds(org.Hs.eg.db, 
                           keys = group_genes, 
                           column = "ENTREZID", 
                           keytype = "ENSEMBL")
        entrez_ids <- na.omit(unique(entrez_ids))
        
        # Perform enrichment
        enriched <- enrichKEGG(
            gene = entrez_ids,
            organism = organism,
            pvalueCutoff = pvalue_cutoff
        )
        
        # Convert to dataframe and add group information
        result_df <- as.data.frame(enriched)
        if(nrow(result_df) > 0) {
            result_df$group <- group_name
        }
        return(result_df)
    }
    
    # Process each group
    unique_groups <- unique(gene_data$group)
    all_results <- lapply(unique_groups, function(group_name) {
        group_genes <- gene_data$gene_id[gene_data$group == group_name]
        process_gene_list(group_genes, group_name)
    })
    
    # Combine results
    combined_results <- do.call(rbind, all_results)
    print('pathway combined_results')
    print(head(combined_results))
    # Save results
    
    saveRDS(combined_results, '${_output['pathway']}', compress='xz')
    print(paste("Results saved to:", '${_output['pathway']}'))