scEEMS Prediction#
Description#
Pipeline Execution Workflow#
Step 1: Input Data Preparation#
Create a tab-separated file with variant identifiers:
variant_id
2:12345:A:T
2:67890:G:C
2:11111:T:A
Format requirements:
Chromosome notation: Numeric (e.g.,
2) or with prefix (e.g.,chr2)Position: 1-based genomic coordinate (GRCh38/hg38 reference)
Alleles: Reference and alternate alleles (ACGT notation)
Step 2: Execute Prediction Pipeline#
cd ~/xqtl-protocol/code/xqtl_modifier_score/
python model_training_model5_only.py Mic_mega_eQTL 2 \
--data_config data_config.yaml \
--model_config model_config.yaml
What happens during execution:
Variant parsing and coordinate validation
Feature annotation (distance, regulatory, population genetics, conservation, deep learning predictions)
Gene constraint and MAF integration with imputation using training statistics
Feature subsetting and absolute value transformations
Model inference with 10x weighting for regulatory features
Probability calibration and output generation
Step 3: Output Files Generated#
File |
Description |
Use Case |
|---|---|---|
|
Serialized CatBoost model |
Future predictions, model inspection |
|
Per-variant predictions with scores |
Primary analysis, variant prioritization |
|
AP/AUC metrics, class distributions |
Model validation, quality control |
|
Feature importance rankings |
Biological interpretation |
Load and Inspect Results#
import pandas as pd
import joblib
import pickle
import numpy as np
# Load trained CatBoost model
MODEL_PATH = "../../data/Mic_mega_eQTL/model_results/model_standard_subset_weighted_chr_chr2_NPR_10.joblib"
model = joblib.load(MODEL_PATH)
print("Model Configuration:")
print(f" Algorithm: {model.__class__.__name__}")
print(f" Features: {model.feature_count_}")
print(f" Classes: {list(model.classes_)}")
print(f" Tree depth: {model.get_params()['depth']}")
print(f" Iterations: {model.tree_count_}")
# Load prediction results
RESULTS_PATH = "../../data/Mic_mega_eQTL/model_results/predictions_parquet_catboost/predictions_weighted_model_chr2.tsv"
results = pd.read_csv(RESULTS_PATH, sep='\t')
print(f"\nLoaded predictions for {len(results)} variants")
print(f"\nAvailable columns:")
print(f" - variant_id: Genomic coordinates")
print(f" - standard_subset_weighted_pred_prob: EMS functional score (0-1)")
print(f" - standard_subset_weighted_pred_label: Binary classification (0/1)")
print(f" - actual_label: True label from test set (if available)")
Example output:
Model Configuration:
Algorithm: CatBoostClassifier
Features: 4839
Classes: [0, 1]
Tree depth: 6
Iterations: 1000
Loaded predictions for 761 variants
Available columns:
- variant_id: Genomic coordinates
- standard_subset_weighted_pred_prob: EMS functional score (0-1)
- standard_subset_weighted_pred_label: Binary classification (0/1)
- actual_label: True label from test set (if available)
Statistical Summary of Predictions#
The standard_subset_weighted_pred_prob column contains continuous EMS scores representing the probability that each variant has functional regulatory impact on gene expression.
score_col = 'standard_subset_weighted_pred_prob'
# Quantitative distribution analysis
print("SCORE DISTRIBUTION ANALYSIS")
print("=" * 50)
print(f"Total variants analyzed: {len(results)}")
print(f"\nPriority Classification:")
high = len(results[results[score_col] > 0.8])
medium = len(results[(results[score_col] > 0.5) & (results[score_col] <= 0.8)])
low = len(results[results[score_col] <= 0.5])
print(f" High priority (>0.8): {high:5d} ({high/len(results)*100:5.1f}%)")
print(f" Medium priority (0.5-0.8): {medium:5d} ({medium/len(results)*100:5.1f}%)")
print(f" Low priority (<0.5): {low:5d} ({low/len(results)*100:5.1f}%)")
# Descriptive statistics
print(f"\nScore Statistics:")
print(f" Mean: {results[score_col].mean():.4f}")
print(f" Median: {results[score_col].median():.4f}")
print(f" Std: {results[score_col].std():.4f}")
print(f" Min: {results[score_col].min():.4f}")
print(f" Max: {results[score_col].max():.4f}")
# Percentile distribution
print(f"\nPercentile Distribution:")
for p in [90, 95, 99]:
val = np.percentile(results[score_col], p)
count = len(results[results[score_col] >= val])
print(f" {p}th percentile: {val:.4f} ({count} variants)")
Example output:
SCORE DISTRIBUTION ANALYSIS
==================================================
Total variants analyzed: 761
Priority Classification:
High priority (>0.8): 12 ( 1.6%)
Medium priority (0.5-0.8): 45 ( 5.9%)
Low priority (<0.5): 704 ( 92.5%)
Score Statistics:
Mean: 0.1245
Median: 0.0823
Std: 0.1567
Min: 0.0012
Max: 0.9234
Percentile Distribution:
90th percentile: 0.3421 (76 variants)
95th percentile: 0.5012 (38 variants)
99th percentile: 0.7834 (8 variants)
Variant Prioritization and Export#
Extract and rank high-confidence functional predictions for downstream experimental validation.
# Identify and sort high-priority variants
high_priority = results[results[score_col] > 0.8].sort_values(score_col, ascending=False)
print(f"HIGH-PRIORITY VARIANTS (Score >0.8)")
print("=" * 50)
print(f"Total: {len(high_priority)} variants")
if len(high_priority) > 0:
print(f"\nTop 10 by EMS score:")
display_cols = ['variant_id', score_col]
if 'actual_label' in results.columns:
display_cols.append('actual_label')
print(high_priority[display_cols].head(10).to_string(index=False))
# Export for experimental design
output_file = "high_priority_variants_validation.tsv"
high_priority.to_csv(output_file, sep='\t', index=False)
print(f"\nExported to: {output_file}")
print(f" Contains all {len(high_priority)} high-priority variants")
print(f" Ready for: CRISPR screens, luciferase assays, functional validation")
else:
print("\nNo variants exceed 0.8 threshold")
print(" Recommended actions:")
print(" 1. Review medium-priority variants (0.5-0.8)")
print(" 2. Consider lowering threshold based on experimental capacity")
print(f" 3. Top variant score: {results[score_col].max():.4f}")
Example output:
HIGH-PRIORITY VARIANTS (Score >0.8)
==================================================
Total: 12 variants
Top 10 by EMS score:
variant_id standard_subset_weighted_pred_prob actual_label
2:54321:A:G 0.9234 1
2:12345:T:C 0.9102 1
2:98765:C:A 0.8956 1
2:44444:G:T 0.8723 0
2:77777:A:C 0.8612 1
2:33333:T:G 0.8501 1
2:66666:C:T 0.8398 1
2:11111:G:A 0.8267 0
2:55555:A:T 0.8145 1
2:99999:T:A 0.8034 1
Exported to: high_priority_variants_validation.tsv
Contains all 12 high-priority variants
Ready for: CRISPR screens, luciferase assays, functional validation
Verify Model Performance#
Review metrics from the held-out test set to understand model reliability.
# Load and display performance metrics
SUMMARY_PATH = "../../data/Mic_mega_eQTL/model_results/summary_dict_catboost_weighted_model_chr_chr2_NPR_10.pkl"
with open(SUMMARY_PATH, 'rb') as f:
summary = pickle.load(f)
print("MODEL PERFORMANCE ON TEST SET")
print("=" * 50)
metrics = summary['CatBoost']['standard_subset_weighted']
print(f"Average Precision (AP): {metrics['AP_test']:.4f}")
print(f"AUC-ROC: {metrics['AUC_test']:.4f}")
print(f"\nTest Set Composition:")
print(f" Positive labels (functional eQTLs): {summary['CatBoost']['test_num_positive_labels']}")
print(f" Negative labels (non-functional): {summary['CatBoost']['test_num_negative_labels']}")
pos_rate = summary['CatBoost']['test_num_positive_labels'] / (summary['CatBoost']['test_num_positive_labels'] + summary['CatBoost']['test_num_negative_labels'])
print(f" Positive rate: {pos_rate:.1%}")
print("\n These metrics reflect performance on 761 held-out chromosome 2 variants")
print(" with stricter selection criteria (PIP >0.9 for positives) than training data.")
Example output:
MODEL PERFORMANCE ON TEST SET
==================================================
Average Precision (AP): 0.5050
AUC-ROC: 0.8978
Test Set Composition:
Positive labels (functional eQTLs): 68
Negative labels (non-functional): 693
Positive rate: 8.9%
These metrics reflect performance on 761 held-out chromosome 2 variants
with stricter selection criteria (PIP >0.9 for positives) than training data.
Application to New Variant Lists#
Workflow for Novel Variants#
To score additional variants not included in the original training/test sets:
1. Prepare input file matching the format above (chr:pos:ref:alt)
2. Update configuration (data_config.yaml):
Point
training_data.base_dirto directory containing your variant annotationsEnsure gene constraint file (GeneBayes scores) is accessible
Verify MAF file matches your chromosome
3. Execute pipeline (same command, different input data):
python model_training_model5_only.py [cohort] [chromosome] \
--data_config data_config.yaml \
--model_config model_config.yaml
The pipeline will:
Generate all 4,839 genomic features for your variants
Apply the same preprocessing (subsetting, absolute values, imputation)
Use the trained model for inference
Output predictions in identical format
4. Analyze results using the code blocks above
Steps#
Step 1. Score variants with a trained feature-weighted CatBoost scEEMS model for one cohort / chromosome. The toy command below scores the microglia cohort (protocol_example) on chromosome 2 using a model produced by the training workflow.
Timing: ~varies on typical compute infrastructure.
sos run pipeline/ems_prediction.ipynb predict \
--cohort protocol_example \
--chromosome 2 \
--model_path output/ems_training/protocol_example_chr2_scEEMS_model.joblib \
--data_config code/xqtl_modifier_score/data_config.yaml \
--cwd output/ems_prediction
Command interface#
sos run pipeline/ems_prediction.ipynb -h
Workflow implementation#
The predict step wraps the gems_pipeline.py predict engine. The pipeline loads the trained model, annotates the input variants with the full feature set, applies the 10x deep-learning feature weighting used at training time, and writes per-variant EMS scores. New datasets / cell types are scored by editing the data_config YAML only - no Python code changes are required.
[global]
import os
# Work directory & output directory
parameter: cwd = path('output/ems_prediction')
# Cohort / cell type to score (must match an entry in the data config)
parameter: cohort = 'protocol_example'
# Chromosome to score
parameter: chromosome = '2'
# Trained CatBoost model (.joblib) from the EMS Training workflow
parameter: model_path = path('output/ems_training/protocol_example_chr2_scEEMS_model.joblib')
# Data configuration YAML (cohort, variant list, feature set)
parameter: data_config = path('code/xqtl_modifier_score/data_config.yaml')
# Directory holding gems_pipeline.py
parameter: pipeline_dir = path('code/xqtl_modifier_score')
parameter: job_size = 1
parameter: mem = '60G'
parameter: walltime = '24h'
[predict]
output: f'{pipeline_dir:a}/' + __import__('yaml').safe_load(open(data_config))['output']['base_dir'].format(cohort=cohort) + f"/{__import__('yaml').safe_load(open(data_config))['output']['predictions_dir']}/predictions_weighted_model_chr{chromosome}.tsv"
task: trunk_workers = 1, trunk_size = job_size, mem = mem, walltime = walltime, tags = f'{step_name}_{cohort}_chr{chromosome}'
bash: expand = "$[ ]", workdir = pipeline_dir, stderr = f'{_output[0]}.stderr', stdout = f'{_output[0]}.stdout'
python gems_pipeline.py predict $[cohort] $[chromosome] \
--model_path $[model_path:a] \
--data_config $[data_config:a]
Anticipated Results#
Running the predict step produces per-variant scEEMS prediction scores for the supplied variant list. See the Pipeline Execution Workflow walkthrough above for the statistical summary, variant prioritization, and model-performance verification of these outputs.