Phenotype data imputation#
Description#
Empirical Bayes Matrix Factorization (EBMF) is the primary imputation method we use to impute molecular phenotype data [cf. Qi et al., medRxiv, 2023]. We also provide a collection of other methods to impute missing omics data values including missing forest, XGboost, k-nearest neighbors, soft impute, mean imputation, and limit of detection.
Timing: ~varies on typical compute infrastructure.
Input Files#
File |
Description |
|---|---|
|
Toy molecular phenotype matrix in BED format (first four columns |
Output Files#
File |
Description |
|---|---|
|
Imputed molecular phenotype matrix in the same BED layout, bgzipped and tabix-indexed. |
|
Saved flashier factor object from the gEBMF fit (written only when |
Steps#
The pipeline offers several imputation methods. Any one of them can be run on the toy file below; each writes an imputed BED next to the --cwd output directory. Two practical notes for the toy data:
--num_factorfor the EBMF / gEBMF factor models must be smaller than the number of samples (the toy set has 60 samples), so the examples use--num_factor 30.Leave quality control enabled (do not pass
--no-qc-prior-to-impute) so that the QC matrix used for the output is available to every method.
gEBMF (grouped Empirical Bayes Matrix Factorization)#
Grouped Empirical Bayes Matrix Factorization. Extends EBMF by fitting factors within row groups (here, by chromosome), so structure shared across molecular features in the same group is borrowed when filling in missing values. This is the recommended default for the protocol.
sos run pipeline/phenotype_imputation.ipynb gEBMF \
--phenoFile input/proteomics/protocol_example.protein.missing.bed.gz \
--cwd output/phenotype_imputation_uf \
--num_factor 30
EBMF (Empirical Bayes Matrix Factorization)#
Empirical Bayes Matrix Factorization. Decomposes the phenotype matrix into a small number of latent factors with adaptive shrinkage priors, then reconstructs the matrix to impute missing entries. Captures global low-rank structure across all samples and features.
sos run pipeline/phenotype_imputation.ipynb EBMF \
--phenoFile input/proteomics/protocol_example.protein.missing.bed.gz \
--cwd output/phenotype_imputation_uf \
--num_factor 30
missForest#
A non-parametric, iterative random-forest imputation. Each feature with missing values is predicted in turn from the other features using a random forest, repeating until the imputed values stabilize. Handles non-linear relationships but is computationally heavier.
sos run pipeline/phenotype_imputation.ipynb missforest \
--phenoFile input/proteomics/protocol_example.protein.missing.bed.gz \
--cwd output/phenotype_imputation_uf
missXGBoost#
Iterative imputation using gradient-boosted trees (XGBoost) as the per-feature predictor instead of random forests. Often faster and more accurate than missForest on large matrices while still modeling non-linear dependencies between features.
sos run pipeline/phenotype_imputation.ipynb missxgboost \
--phenoFile input/proteomics/protocol_example.protein.missing.bed.gz \
--cwd output/phenotype_imputation_uf
KNN (k-nearest-neighbours)#
k-Nearest-Neighbours imputation. A missing entry is filled with a (distance-weighted) average of the corresponding values from the k most similar samples. Simple and fast, and works well when samples cluster into similar profiles.
sos run pipeline/phenotype_imputation.ipynb knn \
--phenoFile input/proteomics/protocol_example.protein.missing.bed.gz \
--cwd output/phenotype_imputation_uf
SoftImpute#
Matrix-completion via iterative soft-thresholded singular value decomposition. Assumes the phenotype matrix is approximately low-rank and recovers missing entries by repeatedly shrinking the singular values. A good linear baseline for structured data.
sos run pipeline/phenotype_imputation.ipynb soft \
--phenoFile input/proteomics/protocol_example.protein.missing.bed.gz \
--cwd output/phenotype_imputation_uf
Mean imputation#
Mean imputation. Each missing value is replaced by the mean of the observed values for that feature. The simplest baseline; fast but ignores correlations between samples and features.
sos run pipeline/phenotype_imputation.ipynb mean \
--phenoFile input/proteomics/protocol_example.protein.missing.bed.gz \
--cwd output/phenotype_imputation_uf
LOD (limit-of-detection) imputation#
Limit-of-detection imputation. Missing values are replaced with a low constant derived from the smallest observed values, appropriate when missingness is driven by measurements falling below an assay’s detection threshold (e.g., low-abundance proteins/metabolites).
sos run pipeline/phenotype_imputation.ipynb lod \
--phenoFile input/proteomics/protocol_example.protein.missing.bed.gz \
--cwd output/phenotype_imputation_uf
Command Interface#
sos run phenotype_imputation.ipynb -h
usage: sos run phenotype_imputation.ipynb
[workflow_name | -t targets] [options] [workflow_options]
workflow_name: Single or combined workflows defined in this script
targets: One or more targets to generate
options: Single-hyphen sos parameters (see "sos run -h" for details)
workflow_options: Double-hyphen workflow-specific parameters
Workflows:
EBMF
gEBMF
missforest
missxgboost
knn
soft
mean
lod
bed_filter_na
Global Workflow Options:
--cwd output (as path)
Work directory & output directory
--phenoFile VAL (as path, required)
Molecular phenotype matrix
--[no-]qc-prior-to-impute (default to True)
QC before imputation to remove unqualified features
--job-size 1 (as int)
For cluster jobs, number commands to run per job
--walltime 72h
Wall clock time expected
--mem 16G
Memory expected
--numThreads 20 (as int)
Number of threads
--container ''
Sections
EBMF:
Workflow Options:
--prior 'ebnm_point_laplace'
prior distribution of loadings and factors
--varType 1
--num-factor 60
gEBMF:
Workflow Options:
--nCores 1
--num-factor 60
--backfit-iter 3
--[no-]save-flash (default to True)
--[no-]null-check (default to False)
missforest:
missxgboost:
knn:
soft:
mean:
lod:
bed_filter_na:
Workflow Options:
--rank-max 50 (as int)
--lambda-hyp 30 (as int)
--impute-method soft
--tol-missing 0.05 (as float)
Tolerance of missingness rows with missing rate larger than tol_missing will be removed, with missing rate smaller than tol_missing will be mean_imputed. Say if
we want to keep rows with less than 5% missing, then we use 0.05 as tol_missing.
Setup and global parameters#
[global]
parameter: modular_script_dir = path('code/script') # override with --modular-script-dir
# Work directory & output directory
parameter: cwd = path("output")
# Molecular phenotype matrix
parameter: phenoFile = path
# QC before imputation to remove unqualified features
parameter: qc_prior_to_impute = True
# For cluster jobs, number commands to run per job
parameter: job_size = 1
# Wall clock time expected
parameter: walltime = "72h"
# Memory expected
parameter: mem = "16G"
# Number of threads
parameter: numThreads = 20
parameter: container = ""
import re
parameter: entrypoint= ""
[EBMF]
# prior distribution of loadings and factors
parameter: prior = "ebnm_point_laplace"
parameter: varType = '1'
parameter: num_factor = '60'
input: phenoFile
output: f'{cwd:a}/{_input:bn}.imputed.bed.gz'
task: trunk_workers = 1, trunk_size = job_size, walltime = walltime, mem = mem, tags = f'{step_name}_{_output:bn}', cores = numThreads
bash: expand= "${ }", stderr = f'{_output:n}.stderr', stdout = f'{_output:n}.stdout', container = container, entrypoint = entrypoint
Rscript ${modular_script_dir}/data_preprocessing/phenotype/phenotype_imputation.R \
--step EBMF \
--cwd "${cwd}" \
--phenoFile "${_input}" \
--output "${_output}" \
--qc-prior-to-impute ${"TRUE" if qc_prior_to_impute else "FALSE"} \
--prior "${prior}" \
--varType "${varType}" \
--num-factor ${num_factor} \
--numThreads ${numThreads}
[gEBMF]
parameter: nCores = '1'
parameter: num_factor = '60'
parameter: backfit_iter = '3'
parameter: save_flash = True
parameter: null_check = False
input: phenoFile
output: f'{cwd:a}/{_input:bn}.imputed.bed.gz'
task: trunk_workers = 1, trunk_size = job_size, walltime = walltime, mem = mem, tags = f'{step_name}_{_output:bn}', cores = numThreads
bash: expand= "${ }", stderr = f'{_output:n}.stderr', stdout = f'{_output:n}.stdout', container = container, entrypoint = entrypoint
Rscript ${modular_script_dir}/data_preprocessing/phenotype/phenotype_imputation.R \
--step gEBMF \
--cwd "${cwd}" \
--phenoFile "${_input}" \
--output "${_output}" \
--qc-prior-to-impute ${"TRUE" if qc_prior_to_impute else "FALSE"} \
--num-factor ${num_factor} \
--nCores ${nCores} \
--backfit-iter ${backfit_iter} \
--save-flash ${"TRUE" if save_flash else "FALSE"} \
--null-check ${"TRUE" if null_check else "FALSE"} \
--numThreads ${numThreads}
[missforest]
input: phenoFile
output: f'{cwd:a}/{_input:bn}.imputed.bed.gz'
task: trunk_workers = 1, trunk_size = job_size, walltime = walltime, mem = mem, tags = f'{step_name}_{_output:bn}'
bash: expand= "${ }", stderr = f'{_output:n}.stderr', stdout = f'{_output:n}.stdout', container = container, entrypoint = entrypoint
Rscript ${modular_script_dir}/data_preprocessing/phenotype/phenotype_imputation.R \
--step missforest \
--cwd "${cwd}" \
--phenoFile "${_input}" \
--output "${_output}" \
--qc-prior-to-impute ${"TRUE" if qc_prior_to_impute else "FALSE"} \
--numThreads ${numThreads}
[missxgboost]
input: phenoFile
output: f'{cwd:a}/{_input:bn}.imputed.bed.gz'
task: trunk_workers = 1, trunk_size = job_size, walltime = walltime, mem = mem, tags = f'{step_name}_{_output:bn}'
bash: expand= "${ }", stderr = f'{_output:n}.stderr', stdout = f'{_output:n}.stdout', container = container, entrypoint = entrypoint
bash ${modular_script_dir}/data_preprocessing/phenotype/phenotype_imputation.sh missxgboost \
--cwd "${cwd}" \
--phenoFile "${_input}" \
--output "${_output}" \
--qc-prior-to-impute ${"TRUE" if qc_prior_to_impute else "FALSE"} \
--numThreads ${numThreads}
[knn]
input: phenoFile
output: f'{cwd:a}/{_input:bn}.imputed.bed.gz'
task: trunk_workers = 1, trunk_size = job_size, walltime = walltime, mem = mem, tags = f'{step_name}_{_output:bn}'
bash: expand= "${ }", stderr = f'{_output:n}.stderr', stdout = f'{_output:n}.stdout', container = container, entrypoint = entrypoint
Rscript ${modular_script_dir}/data_preprocessing/phenotype/phenotype_imputation.R \
--step knn \
--cwd "${cwd}" \
--phenoFile "${_input}" \
--output "${_output}" \
--qc-prior-to-impute ${"TRUE" if qc_prior_to_impute else "FALSE"} \
--numThreads ${numThreads}
[soft]
input: phenoFile
output: f'{cwd:a}/{_input:bn}.imputed.bed.gz'
task: trunk_workers = 1, trunk_size = job_size, walltime = walltime, mem = mem, tags = f'{step_name}_{_output:bn}'
bash: expand= "${ }", stderr = f'{_output:n}.stderr', stdout = f'{_output:n}.stdout', container = container, entrypoint = entrypoint
Rscript ${modular_script_dir}/data_preprocessing/phenotype/phenotype_imputation.R \
--step soft \
--cwd "${cwd}" \
--phenoFile "${_input}" \
--output "${_output}" \
--qc-prior-to-impute ${"TRUE" if qc_prior_to_impute else "FALSE"} \
--numThreads ${numThreads}
[mean]
input: phenoFile
output: f'{cwd:a}/{_input:bn}.imputed.bed.gz'
task: trunk_workers = 1, trunk_size = job_size, walltime = walltime, mem = mem, tags = f'{step_name}_{_output:bn}'
bash: expand= "${ }", stderr = f'{_output:n}.stderr', stdout = f'{_output:n}.stdout', container = container, entrypoint = entrypoint
Rscript ${modular_script_dir}/data_preprocessing/phenotype/phenotype_imputation.R \
--step mean \
--cwd "${cwd}" \
--phenoFile "${_input}" \
--output "${_output}" \
--qc-prior-to-impute ${"TRUE" if qc_prior_to_impute else "FALSE"} \
--numThreads ${numThreads}
[lod]
input: phenoFile
output: f'{cwd:a}/{_input:bn}.imputed.bed.gz'
task: trunk_workers = 1, trunk_size = job_size, walltime = walltime, mem = mem, tags = f'{step_name}_{_output:bn}'
bash: expand= "${ }", stderr = f'{_output:n}.stderr', stdout = f'{_output:n}.stdout', container = container, entrypoint = entrypoint
Rscript ${modular_script_dir}/data_preprocessing/phenotype/phenotype_imputation.R \
--step lod \
--cwd "${cwd}" \
--phenoFile "${_input}" \
--output "${_output}" \
--qc-prior-to-impute ${"TRUE" if qc_prior_to_impute else "FALSE"} \
--numThreads ${numThreads}
[bed_filter_na]
parameter: rank_max = 50 # max rank estimated in the per-chr methyl matrix
parameter: lambda_hyp = 30 # hyper par, indicating the importance of the nuclear norm
parameter: impute_method = "soft"
# Tolerance of missingness rows with missing rate larger than tol_missing will be removed,
# with missing rate smaller than tol_missing will be mean_imputed. Say if we want to keep rows with less than 5% missing, then we use 0.05 as tol_missing.
parameter: tol_missing = 0.05
input: phenoFile
output: f'{cwd:a}/{_input:nn}.filter_na.{impute_method}.bed.gz'
task: trunk_workers = 1, trunk_size = job_size, walltime = walltime, mem = mem, cores = numThreads
bash: expand= "${ }", stderr = f'{_output:n}.stderr', stdout = f'{_output:n}.stdout', container = container, entrypoint = entrypoint
Rscript ${modular_script_dir}/data_preprocessing/phenotype/phenotype_imputation.R \
--step bed_filter_na \
--cwd "${cwd}" \
--phenoFile "${_input}" \
--output "${_output}" \
--rank-max ${rank_max} \
--lambda-hyp ${lambda_hyp} \
--impute-method "${impute_method}" \
--tol-missing ${tol_missing} \
--numThreads ${numThreads}
The resource usage for softimputing 450K methylation data are as followed:
time elapsed: 880.90s
peak first occurred: 152.11s
peak last occurred: 175.41s
max vms_memory: 38.95GB
max rss_memory: 34.35GB
memory check interval: 1s
return code: 0
Anticipated Results#
The pipeline produces output files in the output/ subdirectory named after the workflow step. Verify success by checking that output files exist and are non-empty. See the Output section above for the expected file names and formats.