scEEMS Prediction

scEEMS Prediction#

Description#

Pipeline Execution Workflow#

Step 1: Input Data Preparation#

Create a tab-separated file with variant identifiers:

variant_id
12345:A:T
67890:G:C
11111:T:A

Format requirements:

Chromosome notation: Numeric (e.g., 2) or with prefix (e.g., chr2)
Position: 1-based genomic coordinate (GRCh38/hg38 reference)
Alleles: Reference and alternate alleles (ACGT notation)

Step 2: Execute Prediction Pipeline#

cd ~/xqtl-protocol/code/xqtl_modifier_score/
python model_training_model5_only.py Mic_mega_eQTL 2 \
  --data_config data_config.yaml \
  --model_config model_config.yaml

What happens during execution:

Variant parsing and coordinate validation
Feature annotation (distance, regulatory, population genetics, conservation, deep learning predictions)
Gene constraint and MAF integration with imputation using training statistics
Feature subsetting and absolute value transformations
Model inference with 10x weighting for regulatory features
Probability calibration and output generation

Step 3: Output Files Generated#

File	Description	Use Case
`model_standard_subset_weighted_chr_chr2_NPR_10.joblib`	Serialized CatBoost model	Future predictions, model inspection
`predictions_weighted_model_chr2.tsv`	Per-variant predictions with scores	Primary analysis, variant prioritization
`summary_dict_catboost_weighted_model_chr_chr2_NPR_10.pkl`	AP/AUC metrics, class distributions	Model validation, quality control
`features_importance_model5_chr_chr2_NPR_10.csv`	Feature importance rankings	Biological interpretation

Load and Inspect Results#

import pandas as pd
import joblib
import pickle
import numpy as np

# Load trained CatBoost model
MODEL_PATH = "../../data/Mic_mega_eQTL/model_results/model_standard_subset_weighted_chr_chr2_NPR_10.joblib"
model = joblib.load(MODEL_PATH)

print("Model Configuration:")
print(f"  Algorithm: {model.__class__.__name__}")
print(f"  Features: {model.feature_count_}")
print(f"  Classes: {list(model.classes_)}")
print(f"  Tree depth: {model.get_params()['depth']}")
print(f"  Iterations: {model.tree_count_}")

# Load prediction results
RESULTS_PATH = "../../data/Mic_mega_eQTL/model_results/predictions_parquet_catboost/predictions_weighted_model_chr2.tsv"
results = pd.read_csv(RESULTS_PATH, sep='\t')

print(f"\nLoaded predictions for {len(results)} variants")
print(f"\nAvailable columns:")
print(f"  - variant_id: Genomic coordinates")
print(f"  - standard_subset_weighted_pred_prob: EMS functional score (0-1)")
print(f"  - standard_subset_weighted_pred_label: Binary classification (0/1)")
print(f"  - actual_label: True label from test set (if available)")

Example output:

Model Configuration:
  Algorithm: CatBoostClassifier
  Features: 4839
  Classes: [0, 1]
  Tree depth: 6
  Iterations: 1000

Loaded predictions for 761 variants

Available columns:
  - variant_id: Genomic coordinates
  - standard_subset_weighted_pred_prob: EMS functional score (0-1)
  - standard_subset_weighted_pred_label: Binary classification (0/1)
  - actual_label: True label from test set (if available)

Statistical Summary of Predictions#

The standard_subset_weighted_pred_prob column contains continuous EMS scores representing the probability that each variant has functional regulatory impact on gene expression.

score_col = 'standard_subset_weighted_pred_prob'

# Quantitative distribution analysis
print("SCORE DISTRIBUTION ANALYSIS")
print("=" * 50)
print(f"Total variants analyzed: {len(results)}")
print(f"\nPriority Classification:")
high = len(results[results[score_col] > 0.8])
medium = len(results[(results[score_col] > 0.5) & (results[score_col] <= 0.8)])
low = len(results[results[score_col] <= 0.5])
print(f"  High priority (>0.8):     {high:5d} ({high/len(results)*100:5.1f}%)")
print(f"  Medium priority (0.5-0.8): {medium:5d} ({medium/len(results)*100:5.1f}%)")
print(f"  Low priority (<0.5):      {low:5d} ({low/len(results)*100:5.1f}%)")

# Descriptive statistics
print(f"\nScore Statistics:")
print(f"  Mean:   {results[score_col].mean():.4f}")
print(f"  Median: {results[score_col].median():.4f}")
print(f"  Std:    {results[score_col].std():.4f}")
print(f"  Min:    {results[score_col].min():.4f}")
print(f"  Max:    {results[score_col].max():.4f}")

# Percentile distribution
print(f"\nPercentile Distribution:")
for p in [90, 95, 99]:
    val = np.percentile(results[score_col], p)
    count = len(results[results[score_col] >= val])
    print(f"  {p}th percentile: {val:.4f} ({count} variants)")

Example output:

SCORE DISTRIBUTION ANALYSIS
==================================================
Total variants analyzed: 761

Priority Classification:
  High priority (>0.8):        12 (  1.6%)
  Medium priority (0.5-0.8):   45 (  5.9%)
  Low priority (<0.5):        704 ( 92.5%)

Score Statistics:
  Mean:   0.1245
  Median: 0.0823
  Std:    0.1567
  Min:    0.0012
  Max:    0.9234

Percentile Distribution:
  90th percentile: 0.3421 (76 variants)
  95th percentile: 0.5012 (38 variants)
  99th percentile: 0.7834 (8 variants)

Variant Prioritization and Export#

Extract and rank high-confidence functional predictions for downstream experimental validation.

# Identify and sort high-priority variants
high_priority = results[results[score_col] > 0.8].sort_values(score_col, ascending=False)

print(f"HIGH-PRIORITY VARIANTS (Score >0.8)")
print("=" * 50)
print(f"Total: {len(high_priority)} variants")

if len(high_priority) > 0:
    print(f"\nTop 10 by EMS score:")
    display_cols = ['variant_id', score_col]
    if 'actual_label' in results.columns:
        display_cols.append('actual_label')
    print(high_priority[display_cols].head(10).to_string(index=False))
    
    # Export for experimental design
    output_file = "high_priority_variants_validation.tsv"
    high_priority.to_csv(output_file, sep='\t', index=False)
    print(f"\nExported to: {output_file}")
    print(f"   Contains all {len(high_priority)} high-priority variants")
    print(f"   Ready for: CRISPR screens, luciferase assays, functional validation")
else:
    print("\nNo variants exceed 0.8 threshold")
    print("   Recommended actions:")
    print("   1. Review medium-priority variants (0.5-0.8)")
    print("   2. Consider lowering threshold based on experimental capacity")
    print(f"   3. Top variant score: {results[score_col].max():.4f}")

Example output:

HIGH-PRIORITY VARIANTS (Score >0.8)
==================================================
Total: 12 variants

Top 10 by EMS score:
      variant_id  standard_subset_weighted_pred_prob  actual_label
   2:54321:A:G                              0.9234             1
   2:12345:T:C                              0.9102             1
   2:98765:C:A                              0.8956             1
   2:44444:G:T                              0.8723             0
   2:77777:A:C                              0.8612             1
   2:33333:T:G                              0.8501             1
   2:66666:C:T                              0.8398             1
   2:11111:G:A                              0.8267             0
   2:55555:A:T                              0.8145             1
   2:99999:T:A                              0.8034             1

  Exported to: high_priority_variants_validation.tsv
   Contains all 12 high-priority variants
   Ready for: CRISPR screens, luciferase assays, functional validation

Verify Model Performance#

Review metrics from the held-out test set to understand model reliability.

# Load and display performance metrics
SUMMARY_PATH = "../../data/Mic_mega_eQTL/model_results/summary_dict_catboost_weighted_model_chr_chr2_NPR_10.pkl"
with open(SUMMARY_PATH, 'rb') as f:
    summary = pickle.load(f)

print("MODEL PERFORMANCE ON TEST SET")
print("=" * 50)
metrics = summary['CatBoost']['standard_subset_weighted']
print(f"Average Precision (AP): {metrics['AP_test']:.4f}")
print(f"AUC-ROC:                {metrics['AUC_test']:.4f}")

print(f"\nTest Set Composition:")
print(f"  Positive labels (functional eQTLs): {summary['CatBoost']['test_num_positive_labels']}")
print(f"  Negative labels (non-functional):   {summary['CatBoost']['test_num_negative_labels']}")
pos_rate = summary['CatBoost']['test_num_positive_labels'] / (summary['CatBoost']['test_num_positive_labels'] + summary['CatBoost']['test_num_negative_labels'])
print(f"  Positive rate:                       {pos_rate:.1%}")

print("\n These metrics reflect performance on 761 held-out chromosome 2 variants")
print("   with stricter selection criteria (PIP >0.9 for positives) than training data.")

Example output:

MODEL PERFORMANCE ON TEST SET
==================================================
Average Precision (AP): 0.5050
AUC-ROC:                0.8978

Test Set Composition:
  Positive labels (functional eQTLs): 68
  Negative labels (non-functional):   693
  Positive rate:                       8.9%

   These metrics reflect performance on 761 held-out chromosome 2 variants
   with stricter selection criteria (PIP >0.9 for positives) than training data.

Application to New Variant Lists#

Workflow for Novel Variants#

To score additional variants not included in the original training/test sets:

1. Prepare input file matching the format above (chr:pos:ref:alt)

2. Update configuration (data_config.yaml):

Point training_data.base_dir to directory containing your variant annotations
Ensure gene constraint file (GeneBayes scores) is accessible
Verify MAF file matches your chromosome

3. Execute pipeline (same command, different input data):

python model_training_model5_only.py [cohort] [chromosome] \
  --data_config data_config.yaml \
  --model_config model_config.yaml

The pipeline will:

Generate all 4,839 genomic features for your variants
Apply the same preprocessing (subsetting, absolute values, imputation)
Use the trained model for inference
Output predictions in identical format

4. Analyze results using the code blocks above

Steps#

Step 1. Score variants with a trained feature-weighted CatBoost scEEMS model for one cohort / chromosome. The toy command below scores the microglia cohort (protocol_example) on chromosome 2 using a model produced by the training workflow.

Timing: ~varies on typical compute infrastructure.

sos run pipeline/ems_prediction.ipynb predict \
    --cohort protocol_example \
    --chromosome 2 \
    --model_path output/ems_training/protocol_example_chr2_scEEMS_model.joblib \
    --data_config code/xqtl_modifier_score/data_config.yaml \
    --cwd output/ems_prediction

Command interface#

sos run pipeline/ems_prediction.ipynb -h

Workflow implementation#

The predict step wraps the gems_pipeline.py predict engine. The pipeline loads the trained model, annotates the input variants with the full feature set, applies the 10x deep-learning feature weighting used at training time, and writes per-variant EMS scores. New datasets / cell types are scored by editing the data_config YAML only - no Python code changes are required.

[global]
import os
# Work directory & output directory
parameter: cwd = path('output/ems_prediction')
# Cohort / cell type to score (must match an entry in the data config)
parameter: cohort = 'protocol_example'
# Chromosome to score
parameter: chromosome = '2'
# Trained CatBoost model (.joblib) from the EMS Training workflow
parameter: model_path = path('output/ems_training/protocol_example_chr2_scEEMS_model.joblib')
# Data configuration YAML (cohort, variant list, feature set)
parameter: data_config = path('code/xqtl_modifier_score/data_config.yaml')
# Directory holding gems_pipeline.py
parameter: pipeline_dir = path('code/xqtl_modifier_score')
parameter: job_size = 1
parameter: mem = '60G'
parameter: walltime = '24h'

[predict]
output: f'{pipeline_dir:a}/' + __import__('yaml').safe_load(open(data_config))['output']['base_dir'].format(cohort=cohort) + f"/{__import__('yaml').safe_load(open(data_config))['output']['predictions_dir']}/predictions_weighted_model_chr{chromosome}.tsv"
task: trunk_workers = 1, trunk_size = job_size, mem = mem, walltime = walltime, tags = f'{step_name}_{cohort}_chr{chromosome}'
bash: expand = "$[ ]", workdir = pipeline_dir, stderr = f'{_output[0]}.stderr', stdout = f'{_output[0]}.stdout'
    python gems_pipeline.py predict $[cohort] $[chromosome] \
        --model_path $[model_path:a] \
        --data_config $[data_config:a]

Anticipated Results#

Running the predict step produces per-variant scEEMS prediction scores for the supplied variant list. See the Pipeline Execution Workflow walkthrough above for the statistical summary, variant prioritization, and model-performance verification of these outputs.

scEEMS Prediction

Contents

scEEMS Prediction#

Description#

Pipeline Execution Workflow#

Step 1: Input Data Preparation#

Step 2: Execute Prediction Pipeline#

Step 3: Output Files Generated#

Load and Inspect Results#

Statistical Summary of Predictions#

Variant Prioritization and Export#

Verify Model Performance#

Application to New Variant Lists#

Workflow for Novel Variants#

Steps#

Command interface#

Workflow implementation#

Anticipated Results#