Pathway Analysis#
Description#
This workflow performs pathway enrichment analysis using the KEGG database and the clusterProfiler package in R.
1. Purpose#
This workflow performs pathway enrichment analysis using the KEGG database and the clusterProfiler
package in R. It identifies biologically significant KEGG pathways associated with groups of genes based on their enrichment in known pathways.
Key objectives:
Analyze biological significance of gene groups in a pathway context.
Perform enrichment analysis for different groups (e.g., cell types, conditions).
Generate a summarized output for downstream interpretation.
2. Input#
Gene List (TSV format): A table containing grouped genes.
Columns:
group
: The group/category to which a gene belongs (e.g.,Neuron
,Microglia
).gene_id
: ENSEMBL gene IDs for the corresponding genes.
Example:
group gene_id Neuron ENSG00000139618 Neuron ENSG00000091831 Microglia ENSG00000196839 Microglia ENSG00000081237
Parameters:
pvalue_cutoff
: Cutoff for pathway enrichment significance (default: 1).organism
: KEGG organism code (default:hsa
for humans).
3. Output#
Pathway Results File:
A compressed RDS file (
.rds
) containing enrichment results for all groups.Each group has its pathway results stored in a combined data frame.
Columns in the results:
ID
: KEGG pathway ID.Description
: KEGG pathway name.GeneRatio
: Ratio of genes in the group that are in the pathway.BgRatio
: Ratio of all genes in the database associated with the pathway.pvalue
: Enrichment p-value.qvalue
: Adjusted p-value.group
: Name of the gene group.
4. Workflow Steps#
Load Input Data:
Read the input TSV file containing grouped ENSEMBL gene IDs.
Gene ID Conversion:
Convert ENSEMBL gene IDs to ENTREZ IDs using
org.Hs.eg.db
.Omit genes without valid mappings.
Pathway Enrichment Analysis:
Perform KEGG pathway enrichment analysis using the
enrichKEGG
function for each group.Apply the specified
pvalue_cutoff
to filter results.
Combine Results:
Merge the enrichment results for all groups into a single data frame.
Save Output:
Save the combined results as a compressed RDS file for downstream analysis.
5. Methods#
Pathway Enrichment Analysis:
The workflow uses the
enrichKEGG
function from theclusterProfiler
package.Genes are mapped to KEGG pathways, and statistical tests are performed to identify significantly enriched pathways.
Adjusted p-values (
qvalue
) control for multiple testing.
Gene Conversion:
ENSEMBL IDs are converted to ENTREZ IDs using
org.Hs.eg.db
.Only successfully mapped genes are used for enrichment.
Output Integration:
Results are organized into a structured data frame with pathway enrichment statistics for each group.
Minimal Working Example Steps#
sos run /home/al4225/project/quantile_twas/analysis/pathway/pathway.ipynb pathway_analysis \
--genes_file /home/al4225/project/quantile_twas/analysis/pathway/test_data/157_genes_input.tsv \
--cwd /home/al4225/project/quantile_twas/analysis/pathway/output --name rosmap.157genes -s build
Command interface#
!sos run xqtl-protocol/code/enrichment/pathway.ipynb -h
usage: sos run D:/research_code/software/pecotmr_vqtl/xqtl-protocol/code/enrichment/pathway.ipynb
[workflow_name | -t targets] [options] [workflow_options]
workflow_name: Single or combined workflows defined in this script
targets: One or more targets to generate
options: Single-hyphen sos parameters (see "sos run -h" for details)
workflow_options: Double-hyphen workflow-specific parameters
Workflows:
pathway_analysis
Global Workflow Options:
--cwd output (as path)
Path to the work directory of the analysis.
--genes-file . (as path)
--name VAL (as str, required)
--numThreads 8 (as int)
Number of threads
--job-size 1 (as int)
For cluster jobs, number commands to run per job
--walltime 12h
--mem 16G
--container ''
Container option for software to run the analysis:
docker or singularity
--entrypoint ('micromamba run -a "" -n' + ' ' + re.sub(r'(_apptainer:latest|_docker:latest|\.sif)$', '', container.split('/')[-1])) if container else ""
Sections
pathway_analysis:
Workflow Options:
--pvalue-cutoff 1 (as int)
--organism hsa
Workflow implementation#
[global]
# Path to the work directory of the analysis.
parameter: cwd = path('output')
parameter: genes_file = path() # TSV file with columns: group, gene_id
parameter: name = str
# Number of threads
parameter: numThreads = 8
# For cluster jobs, number commands to run per job
parameter: job_size = 1
parameter: walltime = '12h'
parameter: mem = '16G'
[pathway_analysis]
parameter: pvalue_cutoff = 1
parameter: organism = 'hsa'
output: pathway = f'{cwd:a}/{step_name}/{name}.pathway_results.rds'
task: trunk_workers = 1, trunk_size = job_size, walltime = walltime, mem = mem, cores = numThreads, tags = f'{step_name}_{_output[0]:bnn}'
R: expand = '${ }', stderr = f'{_output[0]}.stderr', stdout = f'{_output[0]}.stdout', container = container, entrypoint = entrypoint
library(org.Hs.eg.db)
library(AnnotationDbi)
library(clusterProfiler)
library(dplyr)
organism <- '${organism}'
print(organism)
pvalue_cutoff <- ${pvalue_cutoff}
print(pvalue_cutoff)
# Read input TSV file
gene_data <- read.table('${genes_file}',
header = TRUE,
sep = "\t",
stringsAsFactors = FALSE)
# Function to convert and perform enrichment for one group
process_gene_list <- function(group_genes, group_name) {
# Convert ENSEMBL to ENTREZ
entrez_ids <- mapIds(org.Hs.eg.db,
keys = group_genes,
column = "ENTREZID",
keytype = "ENSEMBL")
entrez_ids <- na.omit(unique(entrez_ids))
# Perform enrichment
enriched <- enrichKEGG(
gene = entrez_ids,
organism = organism,
pvalueCutoff = pvalue_cutoff
)
# Convert to dataframe and add group information
result_df <- as.data.frame(enriched)
if(nrow(result_df) > 0) {
result_df$group <- group_name
}
return(result_df)
}
# Process each group
unique_groups <- unique(gene_data$group)
all_results <- lapply(unique_groups, function(group_name) {
group_genes <- gene_data$gene_id[gene_data$group == group_name]
process_gene_list(group_genes, group_name)
})
# Combine results
combined_results <- do.call(rbind, all_results)
print('pathway combined_results')
print(head(combined_results))
# Save results
saveRDS(combined_results, '${_output['pathway']}', compress='xz')
print(paste("Results saved to:", '${_output['pathway']}'))