EMS Prediction#
Author: Angelina Siagailo(angiesiagailo@gmail.com) with input from Gao Wang and Katie Cho.
Apply the trained sc-EMS model to score YOUR genetic variants for functional impact. This notebook demonstrates prediction workflows using the model trained in the EMS Training Tutorial.
What This Does#
Input: Your variant list in VCF-style format (
chr:pos:ref:alt)Output: Variants annotated with EMS functional probability scores (continuous range: 0-1)
Score >0.8: High functional probability - prioritize for experimental validation (CRISPR screens, luciferase assays)
Score 0.5-0.8: Moderate functional probability - context-dependent prioritization
Score <0.5: Low functional probability
Purpose: Prioritize variants likely to affect gene expression using the trained feature-weighted CatBoost model
Pipeline Execution Workflow#
Step 1: Input Data Preparation#
Create a tab-separated file with variant identifiers:
variant_id
2:12345:A:T
2:67890:G:C
2:11111:T:A
Format requirements:
Chromosome notation: Numeric (e.g.,
2) or with prefix (e.g.,chr2)Position: 1-based genomic coordinate (GRCh38/hg38 reference)
Alleles: Reference and alternate alleles (ACGT notation)
Step 2: Execute Prediction Pipeline#
cd ~/xqtl-protocol/code/xqtl_modifier_score/
python model_training_model5_only.py Mic_mega_eQTL 2 \
--data_config data_config.yaml \
--model_config model_config.yaml
What happens during execution:
Variant parsing and coordinate validation
Feature annotation (distance, regulatory, population genetics, conservation, deep learning predictions)
Gene constraint and MAF integration with imputation using training statistics
Feature subsetting and absolute value transformations
Model inference with 10x weighting for regulatory features
Probability calibration and output generation
Step 3: Output Files Generated#
File |
Description |
Use Case |
|---|---|---|
|
Serialized CatBoost model |
Future predictions, model inspection |
|
Per-variant predictions with scores |
Primary analysis, variant prioritization |
|
AP/AUC metrics, class distributions |
Model validation, quality control |
|
Feature importance rankings |
Biological interpretation |
Load and Inspect Results#
import pandas as pd
import joblib
import pickle
import numpy as np
# Load trained CatBoost model
MODEL_PATH = "../../data/Mic_mega_eQTL/model_results/model_standard_subset_weighted_chr_chr2_NPR_10.joblib"
model = joblib.load(MODEL_PATH)
print("Model Configuration:")
print(f" Algorithm: {model.__class__.__name__}")
print(f" Features: {model.feature_count_}")
print(f" Classes: {list(model.classes_)}")
print(f" Tree depth: {model.get_params()['depth']}")
print(f" Iterations: {model.tree_count_}")
# Load prediction results
RESULTS_PATH = "../../data/Mic_mega_eQTL/model_results/predictions_parquet_catboost/predictions_weighted_model_chr2.tsv"
results = pd.read_csv(RESULTS_PATH, sep='\t')
print(f"\nLoaded predictions for {len(results)} variants")
print(f"\nAvailable columns:")
print(f" - variant_id: Genomic coordinates")
print(f" - standard_subset_weighted_pred_prob: EMS functional score (0-1)")
print(f" - standard_subset_weighted_pred_label: Binary classification (0/1)")
print(f" - actual_label: True label from test set (if available)")
Example output:
Model Configuration:
Algorithm: CatBoostClassifier
Features: 4839
Classes: [0, 1]
Tree depth: 6
Iterations: 1000
Loaded predictions for 761 variants
Available columns:
- variant_id: Genomic coordinates
- standard_subset_weighted_pred_prob: EMS functional score (0-1)
- standard_subset_weighted_pred_label: Binary classification (0/1)
- actual_label: True label from test set (if available)
Statistical Summary of Predictions#
The standard_subset_weighted_pred_prob column contains continuous EMS scores representing the probability that each variant has functional regulatory impact on gene expression.
score_col = 'standard_subset_weighted_pred_prob'
# Quantitative distribution analysis
print("SCORE DISTRIBUTION ANALYSIS")
print("=" * 50)
print(f"Total variants analyzed: {len(results)}")
print(f"\nPriority Classification:")
high = len(results[results[score_col] > 0.8])
medium = len(results[(results[score_col] > 0.5) & (results[score_col] <= 0.8)])
low = len(results[results[score_col] <= 0.5])
print(f" High priority (>0.8): {high:5d} ({high/len(results)*100:5.1f}%)")
print(f" Medium priority (0.5-0.8): {medium:5d} ({medium/len(results)*100:5.1f}%)")
print(f" Low priority (<0.5): {low:5d} ({low/len(results)*100:5.1f}%)")
# Descriptive statistics
print(f"\nScore Statistics:")
print(f" Mean: {results[score_col].mean():.4f}")
print(f" Median: {results[score_col].median():.4f}")
print(f" Std: {results[score_col].std():.4f}")
print(f" Min: {results[score_col].min():.4f}")
print(f" Max: {results[score_col].max():.4f}")
# Percentile distribution
print(f"\nPercentile Distribution:")
for p in [90, 95, 99]:
val = np.percentile(results[score_col], p)
count = len(results[results[score_col] >= val])
print(f" {p}th percentile: {val:.4f} ({count} variants)")
Example output:
SCORE DISTRIBUTION ANALYSIS
==================================================
Total variants analyzed: 761
Priority Classification:
High priority (>0.8): 12 ( 1.6%)
Medium priority (0.5-0.8): 45 ( 5.9%)
Low priority (<0.5): 704 ( 92.5%)
Score Statistics:
Mean: 0.1245
Median: 0.0823
Std: 0.1567
Min: 0.0012
Max: 0.9234
Percentile Distribution:
90th percentile: 0.3421 (76 variants)
95th percentile: 0.5012 (38 variants)
99th percentile: 0.7834 (8 variants)
Variant Prioritization and Export#
Extract and rank high-confidence functional predictions for downstream experimental validation.
# Identify and sort high-priority variants
high_priority = results[results[score_col] > 0.8].sort_values(score_col, ascending=False)
print(f"HIGH-PRIORITY VARIANTS (Score >0.8)")
print("=" * 50)
print(f"Total: {len(high_priority)} variants")
if len(high_priority) > 0:
print(f"\nTop 10 by EMS score:")
display_cols = ['variant_id', score_col]
if 'actual_label' in results.columns:
display_cols.append('actual_label')
print(high_priority[display_cols].head(10).to_string(index=False))
# Export for experimental design
output_file = "high_priority_variants_validation.tsv"
high_priority.to_csv(output_file, sep='\t', index=False)
print(f"\nExported to: {output_file}")
print(f" Contains all {len(high_priority)} high-priority variants")
print(f" Ready for: CRISPR screens, luciferase assays, functional validation")
else:
print("\nNo variants exceed 0.8 threshold")
print(" Recommended actions:")
print(" 1. Review medium-priority variants (0.5-0.8)")
print(" 2. Consider lowering threshold based on experimental capacity")
print(f" 3. Top variant score: {results[score_col].max():.4f}")
Example output:
HIGH-PRIORITY VARIANTS (Score >0.8)
==================================================
Total: 12 variants
Top 10 by EMS score:
variant_id standard_subset_weighted_pred_prob actual_label
2:54321:A:G 0.9234 1
2:12345:T:C 0.9102 1
2:98765:C:A 0.8956 1
2:44444:G:T 0.8723 0
2:77777:A:C 0.8612 1
2:33333:T:G 0.8501 1
2:66666:C:T 0.8398 1
2:11111:G:A 0.8267 0
2:55555:A:T 0.8145 1
2:99999:T:A 0.8034 1
Exported to: high_priority_variants_validation.tsv
Contains all 12 high-priority variants
Ready for: CRISPR screens, luciferase assays, functional validation
Verify Model Performance#
Review metrics from the held-out test set to understand model reliability.
# Load and display performance metrics
SUMMARY_PATH = "../../data/Mic_mega_eQTL/model_results/summary_dict_catboost_weighted_model_chr_chr2_NPR_10.pkl"
with open(SUMMARY_PATH, 'rb') as f:
summary = pickle.load(f)
print("MODEL PERFORMANCE ON TEST SET")
print("=" * 50)
metrics = summary['CatBoost']['standard_subset_weighted']
print(f"Average Precision (AP): {metrics['AP_test']:.4f}")
print(f"AUC-ROC: {metrics['AUC_test']:.4f}")
print(f"\nTest Set Composition:")
print(f" Positive labels (functional eQTLs): {summary['CatBoost']['test_num_positive_labels']}")
print(f" Negative labels (non-functional): {summary['CatBoost']['test_num_negative_labels']}")
pos_rate = summary['CatBoost']['test_num_positive_labels'] / (summary['CatBoost']['test_num_positive_labels'] + summary['CatBoost']['test_num_negative_labels'])
print(f" Positive rate: {pos_rate:.1%}")
print("\n These metrics reflect performance on 761 held-out chromosome 2 variants")
print(" with stricter selection criteria (PIP >0.9 for positives) than training data.")
Example output:
MODEL PERFORMANCE ON TEST SET
==================================================
Average Precision (AP): 0.5050
AUC-ROC: 0.8978
Test Set Composition:
Positive labels (functional eQTLs): 68
Negative labels (non-functional): 693
Positive rate: 8.9%
These metrics reflect performance on 761 held-out chromosome 2 variants
with stricter selection criteria (PIP >0.9 for positives) than training data.
Application to New Variant Lists#
Workflow for Novel Variants#
To score additional variants not included in the original training/test sets:
1. Prepare input file matching the format above (chr:pos:ref:alt)
2. Update configuration (data_config.yaml):
Point
training_data.base_dirto directory containing your variant annotationsEnsure gene constraint file (GeneBayes scores) is accessible
Verify MAF file matches your chromosome
3. Execute pipeline (same command, different input data):
python model_training_model5_only.py [cohort] [chromosome] \
--data_config data_config.yaml \
--model_config model_config.yaml
The pipeline will:
Generate all 4,839 genomic features for your variants
Apply the same preprocessing (subsetting, absolute values, imputation)
Use the trained model for inference
Output predictions in identical format
4. Analyze results using the code blocks above