EMS Prediction#

Author: Angelina Siagailo(angiesiagailo@gmail.com) with input from Gao Wang and Katie Cho.

Apply the trained sc-EMS model to score YOUR genetic variants for functional impact. This notebook demonstrates prediction workflows using the model trained in the EMS Training Tutorial.

What This Does#

  • Input: Your variant list in VCF-style format (chr:pos:ref:alt)

  • Output: Variants annotated with EMS functional probability scores (continuous range: 0-1)

    • Score >0.8: High functional probability - prioritize for experimental validation (CRISPR screens, luciferase assays)

    • Score 0.5-0.8: Moderate functional probability - context-dependent prioritization

    • Score <0.5: Low functional probability

  • Purpose: Prioritize variants likely to affect gene expression using the trained feature-weighted CatBoost model

Pipeline Execution Workflow#

Step 1: Input Data Preparation#

Create a tab-separated file with variant identifiers:

variant_id
2:12345:A:T
2:67890:G:C
2:11111:T:A

Format requirements:

  • Chromosome notation: Numeric (e.g., 2) or with prefix (e.g., chr2)

  • Position: 1-based genomic coordinate (GRCh38/hg38 reference)

  • Alleles: Reference and alternate alleles (ACGT notation)

Step 2: Execute Prediction Pipeline#

cd ~/xqtl-protocol/code/xqtl_modifier_score/
python model_training_model5_only.py Mic_mega_eQTL 2 \
  --data_config data_config.yaml \
  --model_config model_config.yaml

What happens during execution:

  1. Variant parsing and coordinate validation

  2. Feature annotation (distance, regulatory, population genetics, conservation, deep learning predictions)

  3. Gene constraint and MAF integration with imputation using training statistics

  4. Feature subsetting and absolute value transformations

  5. Model inference with 10x weighting for regulatory features

  6. Probability calibration and output generation

Step 3: Output Files Generated#

File

Description

Use Case

model_standard_subset_weighted_chr_chr2_NPR_10.joblib

Serialized CatBoost model

Future predictions, model inspection

predictions_weighted_model_chr2.tsv

Per-variant predictions with scores

Primary analysis, variant prioritization

summary_dict_catboost_weighted_model_chr_chr2_NPR_10.pkl

AP/AUC metrics, class distributions

Model validation, quality control

features_importance_model5_chr_chr2_NPR_10.csv

Feature importance rankings

Biological interpretation

Load and Inspect Results#

import pandas as pd
import joblib
import pickle
import numpy as np

# Load trained CatBoost model
MODEL_PATH = "../../data/Mic_mega_eQTL/model_results/model_standard_subset_weighted_chr_chr2_NPR_10.joblib"
model = joblib.load(MODEL_PATH)

print("Model Configuration:")
print(f"  Algorithm: {model.__class__.__name__}")
print(f"  Features: {model.feature_count_}")
print(f"  Classes: {list(model.classes_)}")
print(f"  Tree depth: {model.get_params()['depth']}")
print(f"  Iterations: {model.tree_count_}")

# Load prediction results
RESULTS_PATH = "../../data/Mic_mega_eQTL/model_results/predictions_parquet_catboost/predictions_weighted_model_chr2.tsv"
results = pd.read_csv(RESULTS_PATH, sep='\t')

print(f"\nLoaded predictions for {len(results)} variants")
print(f"\nAvailable columns:")
print(f"  - variant_id: Genomic coordinates")
print(f"  - standard_subset_weighted_pred_prob: EMS functional score (0-1)")
print(f"  - standard_subset_weighted_pred_label: Binary classification (0/1)")
print(f"  - actual_label: True label from test set (if available)")

Example output:

Model Configuration:
  Algorithm: CatBoostClassifier
  Features: 4839
  Classes: [0, 1]
  Tree depth: 6
  Iterations: 1000

Loaded predictions for 761 variants

Available columns:
  - variant_id: Genomic coordinates
  - standard_subset_weighted_pred_prob: EMS functional score (0-1)
  - standard_subset_weighted_pred_label: Binary classification (0/1)
  - actual_label: True label from test set (if available)

Statistical Summary of Predictions#

The standard_subset_weighted_pred_prob column contains continuous EMS scores representing the probability that each variant has functional regulatory impact on gene expression.

score_col = 'standard_subset_weighted_pred_prob'

# Quantitative distribution analysis
print("SCORE DISTRIBUTION ANALYSIS")
print("=" * 50)
print(f"Total variants analyzed: {len(results)}")
print(f"\nPriority Classification:")
high = len(results[results[score_col] > 0.8])
medium = len(results[(results[score_col] > 0.5) & (results[score_col] <= 0.8)])
low = len(results[results[score_col] <= 0.5])
print(f"  High priority (>0.8):     {high:5d} ({high/len(results)*100:5.1f}%)")
print(f"  Medium priority (0.5-0.8): {medium:5d} ({medium/len(results)*100:5.1f}%)")
print(f"  Low priority (<0.5):      {low:5d} ({low/len(results)*100:5.1f}%)")

# Descriptive statistics
print(f"\nScore Statistics:")
print(f"  Mean:   {results[score_col].mean():.4f}")
print(f"  Median: {results[score_col].median():.4f}")
print(f"  Std:    {results[score_col].std():.4f}")
print(f"  Min:    {results[score_col].min():.4f}")
print(f"  Max:    {results[score_col].max():.4f}")

# Percentile distribution
print(f"\nPercentile Distribution:")
for p in [90, 95, 99]:
    val = np.percentile(results[score_col], p)
    count = len(results[results[score_col] >= val])
    print(f"  {p}th percentile: {val:.4f} ({count} variants)")

Example output:

SCORE DISTRIBUTION ANALYSIS
==================================================
Total variants analyzed: 761

Priority Classification:
  High priority (>0.8):        12 (  1.6%)
  Medium priority (0.5-0.8):   45 (  5.9%)
  Low priority (<0.5):        704 ( 92.5%)

Score Statistics:
  Mean:   0.1245
  Median: 0.0823
  Std:    0.1567
  Min:    0.0012
  Max:    0.9234

Percentile Distribution:
  90th percentile: 0.3421 (76 variants)
  95th percentile: 0.5012 (38 variants)
  99th percentile: 0.7834 (8 variants)

Variant Prioritization and Export#

Extract and rank high-confidence functional predictions for downstream experimental validation.

# Identify and sort high-priority variants
high_priority = results[results[score_col] > 0.8].sort_values(score_col, ascending=False)

print(f"HIGH-PRIORITY VARIANTS (Score >0.8)")
print("=" * 50)
print(f"Total: {len(high_priority)} variants")

if len(high_priority) > 0:
    print(f"\nTop 10 by EMS score:")
    display_cols = ['variant_id', score_col]
    if 'actual_label' in results.columns:
        display_cols.append('actual_label')
    print(high_priority[display_cols].head(10).to_string(index=False))
    
    # Export for experimental design
    output_file = "high_priority_variants_validation.tsv"
    high_priority.to_csv(output_file, sep='\t', index=False)
    print(f"\nExported to: {output_file}")
    print(f"   Contains all {len(high_priority)} high-priority variants")
    print(f"   Ready for: CRISPR screens, luciferase assays, functional validation")
else:
    print("\nNo variants exceed 0.8 threshold")
    print("   Recommended actions:")
    print("   1. Review medium-priority variants (0.5-0.8)")
    print("   2. Consider lowering threshold based on experimental capacity")
    print(f"   3. Top variant score: {results[score_col].max():.4f}")

Example output:

HIGH-PRIORITY VARIANTS (Score >0.8)
==================================================
Total: 12 variants

Top 10 by EMS score:
      variant_id  standard_subset_weighted_pred_prob  actual_label
   2:54321:A:G                              0.9234             1
   2:12345:T:C                              0.9102             1
   2:98765:C:A                              0.8956             1
   2:44444:G:T                              0.8723             0
   2:77777:A:C                              0.8612             1
   2:33333:T:G                              0.8501             1
   2:66666:C:T                              0.8398             1
   2:11111:G:A                              0.8267             0
   2:55555:A:T                              0.8145             1
   2:99999:T:A                              0.8034             1

  Exported to: high_priority_variants_validation.tsv
   Contains all 12 high-priority variants
   Ready for: CRISPR screens, luciferase assays, functional validation

Verify Model Performance#

Review metrics from the held-out test set to understand model reliability.

# Load and display performance metrics
SUMMARY_PATH = "../../data/Mic_mega_eQTL/model_results/summary_dict_catboost_weighted_model_chr_chr2_NPR_10.pkl"
with open(SUMMARY_PATH, 'rb') as f:
    summary = pickle.load(f)

print("MODEL PERFORMANCE ON TEST SET")
print("=" * 50)
metrics = summary['CatBoost']['standard_subset_weighted']
print(f"Average Precision (AP): {metrics['AP_test']:.4f}")
print(f"AUC-ROC:                {metrics['AUC_test']:.4f}")

print(f"\nTest Set Composition:")
print(f"  Positive labels (functional eQTLs): {summary['CatBoost']['test_num_positive_labels']}")
print(f"  Negative labels (non-functional):   {summary['CatBoost']['test_num_negative_labels']}")
pos_rate = summary['CatBoost']['test_num_positive_labels'] / (summary['CatBoost']['test_num_positive_labels'] + summary['CatBoost']['test_num_negative_labels'])
print(f"  Positive rate:                       {pos_rate:.1%}")

print("\n These metrics reflect performance on 761 held-out chromosome 2 variants")
print("   with stricter selection criteria (PIP >0.9 for positives) than training data.")

Example output:

MODEL PERFORMANCE ON TEST SET
==================================================
Average Precision (AP): 0.5050
AUC-ROC:                0.8978

Test Set Composition:
  Positive labels (functional eQTLs): 68
  Negative labels (non-functional):   693
  Positive rate:                       8.9%

   These metrics reflect performance on 761 held-out chromosome 2 variants
   with stricter selection criteria (PIP >0.9 for positives) than training data.

Application to New Variant Lists#

Workflow for Novel Variants#

To score additional variants not included in the original training/test sets:

1. Prepare input file matching the format above (chr:pos:ref:alt)

2. Update configuration (data_config.yaml):

  • Point training_data.base_dir to directory containing your variant annotations

  • Ensure gene constraint file (GeneBayes scores) is accessible

  • Verify MAF file matches your chromosome

3. Execute pipeline (same command, different input data):

python model_training_model5_only.py [cohort] [chromosome] \
  --data_config data_config.yaml \
  --model_config model_config.yaml

The pipeline will:

  • Generate all 4,839 genomic features for your variants

  • Apply the same preprocessing (subsetting, absolute values, imputation)

  • Use the trained model for inference

  • Output predictions in identical format

4. Analyze results using the code blocks above