scEEMS Model Training Tutorial#
Author: Katie Cho (katiecho007@gmail.com), with input from Angelina Siagailo, Chirag Lakhani, and Gao Wang
This notebook implements the scEEMS (single-cell Enhanced Expression Modifier Scores) training methodology as outlined in the manuscript. scEEMS is a machine learning framework that predicts whether a genetic variant is a causal cell-type-specific eQTL as a function of 4,839 variant, gene, and variant-gene pair features. The models are trained on fine-mapped single-cell eQTL data from six brain cell types (astrocytes, excitatory neurons, inhibitory neurons, microglia, oligodendrocytes, and oligodendrocyte precursor cells) from the ROSMAP cohort, with applications to identifying functional variants relevant to Alzheimer’s disease.
Code Repo#
The scEEMS code repository can be found at daklab/scEEMS. This repository contains instructions for downloading the training and test data in order to train the scEEMS models. It also contains code for how the SHAP analysis, LD Score regression, eMAGMA analysis, and functionally-informed eQTL fine-mapping were conducted.
Motivation#
The “Missing Regulation” Problem#
Most disease-associated GWAS variants lie in non-coding regions of the genome, where they likely modulate gene expression. However, bulk-tissue eQTL studies fail to explain the majority of these variants, a phenomenon termed “missing regulation” (Connally et al., 2022). This gap exists because there are systematic differences between variants identified in eQTL studies versus disease GWAS (Mostafavi et al., 2022):
eQTLs are enriched in promoter regions and affect genes under weaker selective constraint
GWAS variants are enriched in distal enhancer regions and affect genes under stronger selective constraint
Understanding how non-coding GWAS variants modulate gene expression is critical for uncovering disease mechanisms, but several challenges limit our ability to make these connections:
Cell-type specificity: Bulk tissue approaches cannot capture regulatory effects that occur in specific cell types, particularly rare but disease-relevant populations like microglia in Alzheimer’s disease
Enhancer variants: Disease-associated variants in distal enhancers often have weaker eQTL signals that fail to reach statistical significance, especially in underpowered single-cell studies
Limited sample sizes: Single-cell eQTL mapping has reduced statistical power compared to bulk studies, making it difficult to detect true regulatory signals in rare cell types
scEEMS Solution#
scEEMS addresses these challenges by predicting causal cell-type-specific eQTLs using machine learning trained on 4,839 genomic features, including:
Deep learning-based variant effect predictions
Cell-type-specific regulatory annotations
Activity-by-Contact (ABC) enhancer-gene linkages
Distance and evolutionary constraint features
By identifying functional variants in cell-type-specific contexts—particularly in enhancer regions—scEEMS aims to bridge the gap between non-coding GWAS variants and their target genes, improving our understanding of disease mechanisms in Alzheimer’s disease.
Tutorial Objective#
This notebook provides a reproducible pipeline for training scEEMS models, demonstrating the complete methodology from data preparation through model evaluation. The goal is to enable the broader scientific community to apply this approach to their own cell-type-specific eQTL datasets and disease contexts.
Methods Overview#
CatBoost Algorithm#
We use CatBoost, a gradient boosting framework that builds an ensemble of decision trees sequentially. CatBoost is effective for high-dimensional biological datasets with mixed data types.
Model Training Strategy#
We train a CatBoost model with 10x upweighting of deep learning features (feature weight = 10 for DL-VEP features vs. 1 for other features). This model was selected as optimal based on external validation and heritability analysis described in the manuscript.
Training Data Construction#
Data Source: Fine-mapped single-cell eQTLs from six brain cell types in the ROSMAP cohort.
Positive Class (Y=1):
Variants with PIP > 0.05 in a credible set where the maximum PIP exceeds 0.1, OR
Variants with PIP > 0.5 regardless of credible set membership
Negative Class (Y=0):
For each positive variant, we sample 10 negative variants from the same gene with PIP < 0.01, matched on variant type (SNP, insertion, deletion)
Test Set:
Positive variants: PIP > 0.90
Negative variants: 10 matched variants per positive variant with PIP < 0.01
Restricted to MEGA genes only
Sample Weighting#
Negative variants: weight = 1
Positive variants: weighted proportional to their PIP values
Total weight balanced between positive and negative classes
Cross-Validation: Leave-One-Chromosome-Out (LOCO)#
For each of the 22 autosomes:
Train on variants from all other 21 chromosomes
Test on the held-out chromosome
Aggregate predictions from all 22 held-out chromosomes for final performance metrics
Toy Dataset Note#
This tutorial uses chromosome 2 data only for demonstration:
Training: 3,056 variants
Testing: 761 variants (non-overlapping)
The full study trained models across all 22 chromosomes for each of 6 cell types
Input Data: Feature Categories#
Based on the manuscript (Figure 1, Table C), scEEMS uses 4,839 features across these categories:
# Load gene constraint data
import pandas as pd
gene_data = pd.read_excel('data/41588_2024_1820_MOESM4_ESM.xlsx')
print(gene_data.head(3))
Sample output showing GeneBayes constraint scores:
ensg hgnc chrom obs_lof exp_lof post_mean post_lower_95 post_upper_95
ENSG00000198488 HGNC:24141 chr11 12 8.9777 6.46629E-05 7.16256E-06 0.00017805
ENSG00000164363 HGNC:26441 chr5 31 28.55 0.00016062 2.59918E-05 0.00044175
ENSG00000159337 HGNC:30038 chr15 28 41.84 0.00016978 0.000018674 0.00053317
Feature Categories#
1. Distance Features#
Biological Rationale: Physical proximity determines regulatory potential
abs_distance_TSS_log: Log-transformed distance to transcription start site
2. Cell-Type Regulatory Features#
Biological Rationale: Functional genomics assays corresponding to cell-specific regulation
ABC_score_microglia: Activity-by-Contact regulatory activity score
3. Population Genetics Features#
Biological Rationale: Minor allele frequency can impact regulatory activity in cells
gnomad_MAF: Minor allele frequency from population database
4. Conservation Features#
Biological Rationale: Evolutionary constraint of the gene can may impact whether their exists an eQTL
GeneBayes Constraint Score: Bayesian score of gene constraint
5. Deep Learning Predictions#
Biological Rationale: Sequence-to-function model predictions can determine disruption of intermediate molecular phenotypes.
Enformer, ChromBPNet, and BPNet predictions of various functional genomics assays (excluding gene expression predictions)
diff_32_ENCFF140MBA: Z-score normalized Enformer VEP score for H3K4me1:CD8-positive, alpha-beta T cell (using middle 32 bins corresponding 4096 bp flanking the variant).
Output Data Generated#
Output Type |
Description |
Research Use |
|---|---|---|
Trained models |
Single feature-weighted CatBoost classifier |
Variant effect prediction |
Performance metrics |
AP/AUC scores |
Model comparison |
Feature importance |
Genomic feature rankings |
Biological interpretation |
Training Workflow#
Step 1: Running the GEMS Pipeline#
cd ~/xqtl-protocol/code/xqtl_modifier_score/
python gems_pipeline.py train Mic_mega_eQTL 2 \
--data_config data_config.yaml \
--model_config model_config.yaml
The pipeline automatically loads training data, trains the feature-weighted CatBoost model, and generates predictions.
# Configuration file check
import os
required_files = [
'gems_pipeline.py',
'data_config.yaml',
'model_config.yaml'
]
print("Configuration File Check:")
print("=" * 50)
for file in required_files:
status = "✅ Found" if os.path.exists(file) else "❌ Missing"
print(f" {file}: {status}")
Step 2: Results & Performance Analysis#
# Performance results
print("Model Performance Results:")
print("=" * 50)
# Single model results (corrected from previous 4-model approach)
corrected_results = {
'Feature-Weighted Model (Model 5)': {'AP': 0.5050, 'AUC': 0.8978}
}
for model, metrics in corrected_results.items():
print(f"{model}:")
print(f" Average Precision: {metrics['AP']:.4f}")
print(f" AUC Score: {metrics['AUC']:.4f}")
print()
print("Feature Importance (Top 10):")
print("-" * 40)
top_features = [
("abs_distance_TSS_log", 23.58),
("diff_32_ENCFF140MBA", 1.83),
("diff_32_ENCFF457MJJ", 1.12),
("gnomad_MAF", 0.88),
("diff_32_ENCFF455KGB", 0.84),
("diff_32_ENCFF579IKK", 0.80),
("ABC_score_microglia", 0.54),
("microglia_enhancer_promoter_union_atac", 0.58),
("diff_32_ENCFF098QMC", 0.66),
("diff_32_ENCFF757EYT", 0.66)
]
for i, (feature, importance) in enumerate(top_features, 1):
print(f"{i:2d}. {feature:<35} {importance:>6.2f}")
Generated Output Files#
Model File:
model_feature_weighted_chr_chr2_NPR_10.joblib- Trained CatBoost classifier
Analysis Results:
summary_dict_catboost_1model_chr_chr2_NPR_10.pkl- Performance metrics and validation statisticsfeatures_importance_1model_chr_chr2_NPR_10.csv- Complete feature importance rankingspredictions_1model_chr2.tsv- Per-variant prediction probabilities
GEMS Pipeline Modularity#
GEMS: Generalized Expression Modifier Scores#
The pipeline uses gems_pipeline.py (Generalized Expression Modifier Scores), extending beyond single-cell data to work with any tissue or cell type.
Design Principles:
New datasets added by only modifying YAML configuration files
No Python code changes required for different cell types
Subcommand architecture (train/predict) for extensible workflows
Command Structure#
# Training mode
python gems_pipeline.py train <cohort> <chromosome> \
--data_config data_config.yaml \
--model_config model_config.yaml
# Prediction mode (coming soon)
python gems_pipeline.py predict <cohort> <chromosome> \
--model_path results/model.cbm \
--data_config data_config.yaml
Testing Modularity with Different Cell Types#
Test Case:
Original development: Microglia (Mic_mega_eQTL)
Modularity validation: Astrocytes (Ast_mega_eQTL)
# Train on Microglia
python gems_pipeline.py train Mic_mega_eQTL 2 \
--data_config data_config.yaml \
--model_config model_config.yaml
# Train on Astrocytes (same code, different YAML configuration)
python gems_pipeline.py train Ast_mega_eQTL 1 \
--data_config data_config.yaml \
--model_config model_config.yaml
Validation Results:
✅ Automatic path resolution to Astrocyte data
✅ All 4,839 genomic features loaded without code changes
✅ Data preprocessing and imputation applied correctly
✅ Proper train/test split maintained
This demonstrates true modularity—new cell types can be analyzed using only YAML configuration changes.
Using Your Trained Models#
Once training is complete, load the trained models for predictions. Please refer to EMS Predictions for detailed prediction workflows and variant scoring.