Hidden Factor Analysis#
Description#
This step infers hidden factors (latent technical/biological covariates) from a molecular phenotype matrix. Including these factors as covariates in xQTL mapping improves power and controls for unwanted variation.
The pipeline provides several procedures:
PEER – Probabilistic Estimation of Expression Residuals (Stegle et al. 2010), the approach used by GTEx.
PCA (Marchenko-Pastur) – the approach used for our main analyses: residualize the phenotype on known covariates, run PCA, and keep the number of PCs above the Marchenko-Pastur noise threshold.
PCA (Buja & Eyuboglu permutation) – an alternative that chooses the number of PCs by permutation (
jackstraw).BiCV factor analysis with APEX – Bi-cross-validation using the APEX tool.
The minimal working example below runs the PCA (Marchenko-Pastur) method, since it runs fully on the toy data with no external container or network access.
Input Files#
File |
Description |
|---|---|
|
Molecular phenotype matrix in indexed |
|
Merged known-covariate matrix ( |
Notes: the covariate file is optional for PEER but recommended so a proper xQTL model is built. The BiCV/APEX method additionally requires an indexed VCF.
Output Files#
File |
Description |
|---|---|
|
Combined covariate matrix: known covariates + inferred |
|
Intermediate residualized phenotype matrix used as input to PCA. |
For PEER, additional diagnostic outputs (precision, residuals, weights, convergence plots) are produced.
PCA with Marchenko-Pastur (recommended)#
This is the method used for our main analyses. The Marchenko_PC workflow first regresses the molecular phenotype on the known covariates to get residuals, then runs PCA on the residual matrix and keeps the number of principal components whose eigenvalues exceed the Marchenko-Pastur random-matrix noise threshold. The retained components are written out as Hidden_Factor_PC* rows, stacked together with the known covariates so the result is ready to use directly in xQTL mapping. Use --mean-impute-missing to mean-impute any missing phenotype values.
Timing: <1 min.
Steps#
sos run pipeline/covariate_hidden_factor.ipynb Marchenko_PC \
--cwd output/covariate \
--phenoFile output/rnaseq/protocol_example.rnaseq.bed.bed.gz \
--covFile output/covariate/protocol_example.covariates.protocol_example.genotype.merged.plink_qc.plink_qc.prune.pca.gz \
--mean-impute-missing
PEER (GTEx-style)#
PEER infers hidden factors using a MOFA-based probabilistic model (Stegle et al.). The number of factors follows GTEx recommendations based on sample size, or can be fixed with --N. This method runs natively here (it uses the mofapy2 Python package), so the example below is fully runnable on the toy data and produces the factor matrix plus a convergence-diagnostic PDF.
Timing: ~1 min.
sos run pipeline/covariate_hidden_factor.ipynb PEER \
--cwd output/covariate \
--phenoFile output/rnaseq/protocol_example.rnaseq.bed.bed.gz \
--covFile output/covariate/protocol_example.covariates.protocol_example.genotype.merged.plink_qc.plink_qc.prune.pca.gz \
--N 3
PCA with Buja & Eyuboglu permutation#
This is the same PCA workflow as above (PCA), but the number of components to keep is chosen by the Buja & Eyuboglu permutation procedure instead of the Marchenko-Pastur threshold. Set --choose_k_method Buja_Eyuboglu. This route uses jackstraw::permutationPA (B=100 permutations), so it runs a little longer than the Marchenko variant but is fully runnable on the toy data.
Timing: ~1 min.
sos run pipeline/covariate_hidden_factor.ipynb PCA \
--cwd output/covariate \
--phenoFile output/rnaseq/protocol_example.rnaseq.bed.bed.gz \
--covFile output/covariate/protocol_example.covariates.protocol_example.genotype.merged.plink_qc.plink_qc.prune.pca.gz \
--choose_k_method Marchenko \
--mean-impute-missing
BiCV factor analysis with APEX#
BiCV chooses the number of hidden factors by bi-cross-validation using the APEX software (Owen et al. 2016). The pipeline computes residuals on the merged covariates, builds the inputs APEX expects (an indexed VCF placeholder and a phenotype matrix), then calls apex factor. The number of factors follows GTEx recommendations or can be fixed with --N.
Timing: ~30 s.
sos run pipeline/covariate_hidden_factor.ipynb BiCV \
--cwd output/covariate \
--phenoFile output/rnaseq/protocol_example.rnaseq.bed.bed.gz \
--covFile output/covariate/protocol_example.covariates.protocol_example.genotype.merged.plink_qc.plink_qc.prune.pca.gz \
--N 3
Command Interface#
sos run covariate_hidden_factor.ipynb -h
usage: sos run covariate_hidden_factor.ipynb
[workflow_name | -t targets] [options] [workflow_options]
workflow_name: Single or combined workflows defined in this script
targets: One or more targets to generate
options: Single-hyphen sos parameters (see "sos run -h" for details)
workflow_options: Double-hyphen workflow-specific parameters
Workflows:
Marchenko_PC
PEER
BiCV
Global Workflow Options:
--cwd output (as path)
The output directory for generated files. MUST BE FULL
PATH
--covFile VAL (as path, required)
Merged Covariates File
--phenoFile VAL (as path, required)
Path to the input molecular phenotype data.
--name f'{phenoFile:bnn}.{covFile:bn}'
--job-size 1 (as int)
For cluster jobs, number commands to run per job
--walltime 5h
Wall clock time expected
--mem 16G
Memory expected
--numThreads 8 (as int)
Number of threads
--container ''
Software container option
Sections
*_1:
Workflow Options:
--[no-]mean-impute-missing (default to False)
Marchenko_PC_2:
PEER_2:
Workflow Options:
--N 0 (as int)
N PEER factors, If do not specify or specified as 0,
default values suggested by GTEx (based on different
sample size) will be used
--iteration 1000 (as int)
Default values from PEER software: The number of max
iteration
--tol 0.001 (as float)
Prior parameters parameter: Alpha_a = 0.001 parameter:
Alpha_b = 0.1 parameter: Eps_a = 0.1 parameter: Eps_b =
10.0 Tolarance parameters
--[no-]r2-tol (default to False)
parameter: var_tol = 0.00001 minimum variance explained
criteria to drop factors while training
--convergence-mode fast
Convergence mode: Convergence mode for MOFAr "slow",
"medium" or "fast", corresponding to 1e-5%, 1e-4% or
1e-3% deltaELBO change.
PEER_3:
BiCV_2:
BiCV_3:
Workflow Options:
--N 0 (as int)
N factors, if not specify, calculated based on sample
size according to GTeX
--iteration 10 (as int)
The number of iteration
Setup and global parameters#
[global]
parameter: modular_script_dir = path('code/script') # override with --modular-script-dir
# The output directory for generated files. MUST BE FULL PATH
parameter: cwd = path("output")
# Merged Covariates File
parameter: covFile = path
# Path to the input molecular phenotype data.
parameter: phenoFile = path
parameter: name = f'{phenoFile:bnn}.{covFile:bn}'
# For cluster jobs, number commands to run per job
parameter: job_size = 1
# Wall clock time expected
parameter: walltime = "5h"
# Memory expected
parameter: mem = "16G"
# Number of threads
parameter: numThreads = 8
# Software container option
parameter: container = ""
parameter: entrypoint= ""
[*_1(computing residual on merged covariates)]
parameter: mean_impute_missing = False
input: phenoFile, covFile
output: f'{cwd}/{name}.residual.bed.gz'
task: trunk_workers = 1, trunk_size = job_size, walltime = walltime, mem = mem, tags = f'{step_name}_{_output:bn}'
bash: expand= "${ }", stderr = f'{_output:nn}.stderr', stdout = f'{_output:nn}.stdout', container = container, entrypoint = entrypoint
Rscript ${modular_script_dir}/data_preprocessing/covariate/covariate_hidden_factor.R \
--step compute_residual \
--cwd "${cwd}" \
--phenoFile "${phenoFile}" \
--covFile "${covFile}" \
--output "${_output}" \
${"--mean-impute-missing" if mean_impute_missing else ""} \
--numThreads ${numThreads}
Principal Components Analysis on molecular phenotype matrix#
[Marchenko_PC_2, PCA_2]
# Marchenko or Marchenko
parameter: choose_k_method = "Marchenko"
parameter: N = 0
output: f'{cwd}/{_input:bnnn}.{choose_k_method}_PC.gz'
task: trunk_workers = 1, walltime = walltime, mem = mem, cores = numThreads, tags = f'{step_name}_{_output:bn}'
bash: expand= "${ }", stderr = f'{_output:n}.stderr', stdout = f'{_output:n}.stdout', container = container, entrypoint = entrypoint
Rscript ${modular_script_dir}/data_preprocessing/covariate/covariate_hidden_factor.R \
--step Marchenko_PC \
--cwd "${cwd}" \
--residFile "${_input}" \
--covFile "${covFile}" \
--choose-k-method "${choose_k_method}" \
--output "${_output}" \
--N ${N} \
--numThreads ${numThreads}
PEER Method#
[PEER_2]
# N PEER factors, If do not specify or specified as 0, default values suggested by
# GTEx (based on different sample size) will be used
parameter: N = 0
# Default values from PEER software:
## The number of max iteration
parameter: iteration = 1000
### Prior parameters
#parameter: Alpha_a = 0.001
#parameter: Alpha_b = 0.1
#parameter: Eps_a = 0.1
#parameter: Eps_b = 10.0
# Tolarance parameters
parameter: tol = 0.001
#parameter: var_tol = 0.00001
# minimum variance explained criteria to drop factors while training
parameter: r2_tol = False
# Convergence mode: Convergence mode for MOFAr "slow", "medium" or "fast", corresponding to 1e-5%, 1e-4% or 1e-3% deltaELBO change.
parameter: convergence_mode = "fast"
# input is the residual file from the first step
output: f'{cwd:a}/{_input:bnn}.PEER_MODEL.hd5'
task: trunk_workers = 1, walltime = walltime, mem = mem, cores = numThreads, tags = f'{step_name}_{_output[0]:bn}'
bash: expand= "${ }", stderr = f'{_output[0]}.stderr', stdout = f'{_output[0]}.stdout', container = container, entrypoint = entrypoint
Rscript ${modular_script_dir}/data_preprocessing/covariate/covariate_hidden_factor.R \
--step PEER_fit \
--cwd "${cwd}" \
--residFile "${_input}" \
--N ${N} \
--iteration ${iteration} \
--convergence-mode "${convergence_mode}" \
--tol ${tol} \
--r2-tol ${r2_tol} \
--numThreads ${numThreads}
[PEER_3]
output: f'{cwd:a}/{_input:bnn}.PEER.gz', f'{cwd:a}/{_input:bnn}.PEER.diag.pdf'
task: trunk_workers = 1, walltime = walltime, mem = mem, cores = numThreads, tags = f'{step_name}_{_output[0]:bn}'
bash: expand= "${ }", stderr = f'{_output[0]:n}.stderr', stdout = f'{_output[0]:n}.stdout', container = container, entrypoint = entrypoint
Rscript ${modular_script_dir}/data_preprocessing/covariate/covariate_hidden_factor.R \
--step PEER_extract \
--cwd "${cwd}" \
--modelFile "${_input}" \
--covFile "${covFile}" \
--numThreads ${numThreads}
Reference#
BiCV Factor Analysis with APEX#
Factor anallysis using Bi-Cross validation with the APEX software package [cf. Owen et al., Statistical Science, 2016] [cf. Quick et al., bioRxiv, 2020]
Notice that the command options are different from those on the APEX website documentation. The commands on the documentation page does not work (last updated September 2021). The commands below were constructed and tested by our team based on our understanding of the program, without input from APEX authors.
[BiCV_2]
# For cluster jobs, number commands to run per job
import time
output: f'{cwd:a}/{_input:bn}.fake.vcf.gz'
#task: trunk_workers = 1,trunk_size = job_size , walltime = walltime, mem = mem, cores = numThreads, tags = f'{step_name}_{_output:bn}'
R: container=container, expand= "$[ ]", stderr = f'{_output}.stderr', stdout = f'{_output}.stdout', entrypoint = entrypoint
library("dplyr")
library("readr")
## Add fake header
cat(paste("##fileformat=VCFv4.2\n##fileDate=$[time.strftime("%Y%m%d",time.localtime())]\n##source=FAKE\n"), file=$[_output:nr], append=FALSE)
## Add colnames based on bed
pheno = read_delim("$[_input]", delim = "\t",n_max = 1)
colnames(pheno)[1:3] = c("#CHROM","POS","ID")
pheno = cbind(pheno[1:3]%>%mutate(REF = "A", ALT = "C", QUAL = ".",FILTER = ".", INFO = "PR", FORMAT = "GT"),pheno[,5:ncol(pheno)])
pheno%>%write_delim($[_output:nr],delim = "\t",col_names = T, append = T)
bash: container=container, expand= "$[ ]", stderr = f'{_output[0]}.stderr', stdout = f'{_output[0]}.stdout', entrypoint = entrypoint
bgzip -f $[_output:n]
tabix -p vcf $[_output]
[BiCV_3]
# N factors, if not specify, calculated based on sample size according to GTeX
parameter: N = 0
# The number of iteration
parameter: iteration = 10
import pandas as pd
data = pd.read_csv(phenoFile, sep=" ", index_col = 3).drop(["#chr","start","end"],axis = 1)
if N == 0:
if len(data.columns) < 150:
N = 15
elif len(data.columns) < 250:
N = 30
elif len(data.columns) < 350:
N = 45
else:
N = 60
input: output_from(1), output_from(2)
output: f'{cwd:a}/{_input[0]:bnn}.BiCV.gz'
task: trunk_workers = 1, walltime = walltime, mem = mem, cores = numThreads, tags = f'{step_name}_{_output[0]:bn}'
bash: container=container, expand= "$[ ]", stderr = f'{_output[0]}.stderr', stdout = f'{_output[0]}.stdout', entrypoint = entrypoint
apex factor \
--out $[_output[0]:nn] \
--iter $[iteration] \
--factors $[N] \
--bed $[_input[0]] \
--vcf $[_input[1]] \
--threads $[numThreads] $[ f'--cov {covFile}' if covFile.is_file() else '']
Anticipated Results#
The pipeline produces output files in the output/ subdirectory named after the workflow step. Verify success by checking that output files exist and are non-empty. See the Output section above for the expected file names and formats.