Covariate Data Formatting

Covariate Data Formatting#

Description#

Our covariate preprocessing steps merge genotypic principal components and fixed covariate files into one file for downstream QTL analysis.

Input Files#

File	Description
`protocol_example.genotype.merged.plink_qc.plink_qc.prune.pca.rds`	Genotype PCA result (RDS) produced by the genotype PCA module; contains the `pc_scores` matrix of per-sample principal components.
`protocol_example.covariates.tsv`	Fixed/known covariates (e.g. sex, age, PMI, study), samples in rows and covariates in columns with an `#id` header.

Output Files#

File	Description
`protocol_example.covariates.protocol_example.genotype.merged.plink_qc.plink_qc.prune.pca.gz`	Combined covariate matrix: the fixed covariates stacked with the selected genotype PCs, samples as columns and an `#id` header, ready for downstream QTL analysis.

Minimal Working Example Steps#

The data and singularity used in this minimal working example can be found on Synapse.

Merge Covariates and Genotype PCs#

The data and singularity used in this minimal working example can be found on Synapse.

Timing: <1 min

This step reads the genotype PCA result (--pcaFile) and the fixed covariate table (--covFile), keeps the samples present in both, and stacks the chosen genotype PCs on top of the covariates to produce a single matrix for QTL analysis. The number of PCs to keep is set with --k; here we pick the count whose cumulative variance explained stays under 80% by reading the PCA scree file. --tol-cov 0.4 allows samples with up to 40% missing covariate values to be mean-imputed rather than dropped.

sos run pipeline/covariate_formatting.ipynb merge_genotype_pc \
    --cwd output/covariate/ \
    --pcaFile output/genotype/genotype_pca/protocol_example.genotype.merged.plink_qc.plink_qc.prune.pca.rds \
    --covFile input/covariate/protocol_example.covariates.base.tsv \
    --name protocol_example.covariates.protocol_example.genotype.merged.plink_qc.plink_qc.prune.pca \
    --tol-cov 0.4 \
    --k `awk '$3 < 0.8' output/genotype/genotype_pca/protocol_example.genotype.merged.plink_qc.plink_qc.prune.pca.scree.txt | tail -1 | cut -f 1`

Command Interface#

sos run covariate_formatting.ipynb -h

Setup and global parameters#

[global]
parameter: modular_script_dir = path('code/script')  # override with --modular-script-dir
# The output directory for generated files. 
parameter: cwd = path("output")
# The covariate file
parameter: covFile = path
# For cluster jobs, number commands to run per job
parameter: job_size = 1
# Wall clock time expected
parameter: walltime = "5h"
# Memory expected
parameter: mem = "2G"
# Number of threads
parameter: numThreads = 8
# Software container option
parameter: container = ""
parameter: entrypoint=""
cwd = path(f"{cwd:a}")

Step 0. Merge Covariates and Genotype PCs#

Anticipated Results#

The pipeline produces output files in the output/ subdirectory named after the workflow step. Verify success by checking that output files exist and are non-empty. See the Output section above for the expected file names and formats.

[merge_genotype_pc]
# An RDS file as the output of the genotype PCA module
parameter: pcaFile = path
# The number of PCs to retain, by default is 20, in practice can be the number of PC that captured more than 70% PVE
parameter: k = 20
parameter: name = f'{covFile:bn}.{pcaFile:bn}'
# Outliers
parameter: outliersFile = path(".") 
parameter: remove_outliers = False
# Tolerance of missingness in covariates, -1 means do nothing, otherwise for samples with covariates missing rate larger than tol_cov will be removed,
# with missing rate smaller than tol_cov will be kept.
parameter: tol_cov = -1.0 
parameter: mean_impute = True
stop_if(remove_outliers and not outliersFile.is_file(), msg = "No outliers file specified, please add outliers file or remove the remove-outliers flag")
input: pcaFile, covFile
output:  f'{cwd:a}/{name}.gz'
task: trunk_workers = 1, walltime = walltime, mem = mem, cores = numThreads, tags = f'{step_name}_{_output[0]:bn}'
bash: expand= "${ }", stderr = f'{_output:n}.stderr', stdout = f'{_output:n}.stdout', container = container, entrypoint = entrypoint
    Rscript ${modular_script_dir}/data_preprocessing/covariate/covariate_formatting.R \
        --step merge_genotype_pc \
        --cwd "${cwd}" \
        --pcaFile "${pcaFile}" \
        --covFile "${covFile}" \
        --name "${name}" \
        --k ${k} \
        --outliersFile "${outliersFile}" \
        ${"--remove-outliers" if remove_outliers else ""} \
        --tol-cov ${tol_cov} \
        ${"--mean-impute" if mean_impute else ""} \
        --numThreads ${numThreads}