Covariate Data Formatting#
Description#
Our covariate preprocessing steps merge genotypic principal components and fixed covariate files into one file for downstream QTL analysis.
Input#
PCA file as output from the PCA module
Fixed covariate files
Output#
PCA + Covariate file
Minimal Working Example Steps#
The data and singularity used in this minimal working example can be found on Synapse.
i. Merge Covariates and Genotype PCs#
Timing: <1 min
You can edit the total amount of variation you want your PCs to explain by editing the --k parameter. In this example, we chose 80%.
sos run pipeline/covariate_formatting.ipynb merge_genotype_pc \
--cwd output/covariate/ \
--pcaFile output/genotype/genotype_pca/wgs.merged.plink_qc.plink_qc.prune.pca.rds \
--covFile data/covariate/covariates.tsv \
--tol-cov 0.4 \
--k `awk '$3 < 0.8' output/genotype/genotype_pca/wgs.merged.plink_qc.plink_qc.prune.pca.scree.txt | tail -1 | cut -f 1 `
INFO: Running merge_genotype_pc:
INFO: merge_genotype_pc is completed.
INFO: merge_genotype_pc output: /restricted/projectnb/xqtl/xqtl_protocol/toy_xqtl_protocol/output/covariate/covariates.wgs.merged.plink_qc.plink_qc.prune.pca.gz
INFO: Workflow merge_genotype_pc (ID=wca247f02ec8db517) is executed successfully with 1 completed step.
Troubleshooting#
Step |
Substep |
Problem |
Possible Reason |
Solution |
|---|---|---|---|---|
Command Interface#
!sos run covariate_formatting.ipynb -h
usage: sos run covariate_formatting.ipynb
[workflow_name | -t targets] [options] [workflow_options]
workflow_name: Single or combined workflows defined in this script
targets: One or more targets to generate
options: Single-hyphen sos parameters (see "sos run -h" for details)
workflow_options: Double-hyphen workflow-specific parameters
Workflows:
merge_genotype_pc
Global Workflow Options:
--cwd output (as path)
The output directory for generated files.
--covFile VAL (as path, required)
The covariate file
--job-size 1 (as int)
For cluster jobs, number commands to run per job
--walltime 5h
Wall clock time expected
--mem 2G
Memory expected
--numThreads 8 (as int)
Number of threads
--container ''
Software container option
--entrypoint ('micromamba run -a "" -n' + ' ' + re.sub(r'(_apptainer:latest|_docker:latest|\.sif)$', '', container.split('/')[-1])) if container else ""
Sections
merge_genotype_pc:
Workflow Options:
--pcaFile VAL (as path, required)
An RDS file as the output of the genotype PCA module
--k 20 (as int)
The number of PCs to retain, by default is 20, in
practice can be the number of PC that captured more than
70% PVE
--name f'{covFile:bn}.{pcaFile:bn}'
--outliersFile . (as path)
Outliers
--[no-]remove-outliers (default to False)
--tol-cov -1.0 (as float)
Tolerance of missingness in covariates, -1 means do
nothing, otherwise for samples with covariates missing
rate larger than tol_cov will be removed, with missing
rate smaller than tol_cov will be kept.
--[no-]mean-impute (default to True)
Setup and global parameters#
[global]
parameter: renovated_code_dir = path('renovated_code/script') # override with --renovated-code-dir
# The output directory for generated files.
parameter: cwd = path("output")
# The covariate file
parameter: covFile = path
# For cluster jobs, number commands to run per job
parameter: job_size = 1
# Wall clock time expected
parameter: walltime = "5h"
# Memory expected
parameter: mem = "2G"
# Number of threads
parameter: numThreads = 8
# Software container option
parameter: container = ""
parameter: entrypoint=""
cwd = path(f"{cwd:a}")
Step 0: Merge Covariates and Genotype PCs#
[merge_genotype_pc]
# An RDS file as the output of the genotype PCA module
parameter: pcaFile = path
# The number of PCs to retain, by default is 20, in practice can be the number of PC that captured more than 70% PVE
parameter: k = 20
parameter: name = f'{covFile:bn}.{pcaFile:bn}'
# Outliers
parameter: outliersFile = path(".")
parameter: remove_outliers = False
# Tolerance of missingness in covariates, -1 means do nothing, otherwise for samples with covariates missing rate larger than tol_cov will be removed,
# with missing rate smaller than tol_cov will be kept.
parameter: tol_cov = -1.0
parameter: mean_impute = True
stop_if(remove_outliers and not outliersFile.is_file(), msg = "No outliers file specified, please add outliers file or remove the remove-outliers flag")
input: pcaFile, covFile
output: f'{cwd:a}/{name}.gz'
task: trunk_workers = 1, walltime = walltime, mem = mem, cores = numThreads, tags = f'{step_name}_{_output[0]:bn}'
bash: expand= "${ }", stderr = f'{_output:n}.stderr', stdout = f'{_output:n}.stdout', container = container, entrypoint = entrypoint
Rscript ${renovated_code_dir}/data_preprocessing/covariate/covariate_formatting.R \
--step merge_genotype_pc \
--cwd "${cwd}" \
--pcaFile "${pcaFile}" \
--covFile "${covFile}" \
--name "${name}" \
--k ${k} \
--outliersFile "${outliersFile}" \
${"--remove-outliers" if remove_outliers else ""} \
--tol-cov ${tol_cov} \
${"--mean-impute" if mean_impute else ""} \
--numThreads ${numThreads}