Covariate Data Preprocessing

Covariate Data Preprocessing#

Description#

This notebook runs the full covariate preprocessing pipeline end-to-end on the toy dataset. It chains the two covariate worker notebooks: it first merges the user-supplied covariates with the genotype principal components, then computes residuals on the merged covariates and infers hidden factors. The result is a single covariate matrix (known covariates + genotype PCs + hidden factors) ready for QTL association testing.

Timing: < 3 minutes on the toy data.

Input Files#

File	Description
`input/covariate/protocol_example.covariates.base.tsv`	Known sample covariates (e.g. sex, age, study)
`output/genotype/genotype_pca/protocol_example.genotype.merged.plink_qc.plink_qc.prune.pca.rds`	Genotype PCA result (from genotype preprocessing)
`output/genotype/genotype_pca/protocol_example.genotype.merged.plink_qc.plink_qc.prune.pca.scree.txt`	Scree values used to choose the number of PCs to keep
`output/rnaseq/protocol_example.rnaseq.bed.bed.gz`	Molecular phenotype matrix (from phenotype preprocessing)

Output Files#

File	Description
`output/covariate/protocol_example.covariates.protocol_example.genotype.merged.plink_qc.plink_qc.prune.pca.gz`	Merged covariates + genotype PCs (output of step i)
`output/covariate/...prune.pca.Marchenko_PC.gz`	Final covariate matrix with known covariates, genotype PCs, and inferred hidden factors, ready for TensorQTL

Steps#

Merge covariates and genotype PCs#

Merge the known covariates with the genotype PCs, keeping only PCs that explain a meaningful fraction of variance. The number of PCs to keep is read from the scree file (here, components with the scree statistic below 0.8). This writes the merged covariate file used as input to the hidden-factor step.

sos run pipeline/covariate_formatting.ipynb merge_genotype_pc \
    --cwd output/covariate/ \
    --pcaFile output/genotype/genotype_pca/protocol_example.genotype.merged.plink_qc.plink_qc.prune.pca.rds \
    --covFile input/covariate/protocol_example.covariates.base.tsv \
    --name protocol_example.covariates.protocol_example.genotype.merged.plink_qc.plink_qc.prune.pca \
    --tol-cov 0.4 \
    --k `awk '$3 < 0.8' output/genotype/genotype_pca/protocol_example.genotype.merged.plink_qc.plink_qc.prune.pca.scree.txt | tail -1 | cut -f 1`

Compute residuals and infer hidden factors#

Using the merged covariate file and the molecular phenotype matrix, compute residuals on the known covariates and infer hidden factors with the Marchenko-Pastur PCA method. --mean-impute-missing fills any missing covariate values with the column mean. The output is the final covariate matrix (known covariates + genotype PCs + hidden factors).

Anticipated Results#

The pipeline produces output files in the output/ subdirectory named after the workflow step. Verify success by checking that output files exist and are non-empty. See the Output section above for the expected file names and formats.

sos run pipeline/covariate_hidden_factor.ipynb Marchenko_PC \
    --cwd output/covariate \
    --phenoFile output/rnaseq/protocol_example.rnaseq.bed.bed.gz \
    --covFile output/covariate/protocol_example.covariates.protocol_example.genotype.merged.plink_qc.plink_qc.prune.pca.gz \
    --mean-impute-missing