Covariate Data Preprocessing#

This notebook contains workflow of processing covariate files and computes PCA-derived covariates from phenotype data.

Miniprotocol Timing#

This represents the total duration for all miniprotocol phases. While module-specific timings are provided separately on their respective pages, they are also included in this overall estimate.

Timing < 3 minutes

Overview#

This workflow is an application of the covariate related sections from the xQTL project pipeline.

  1. covariate_formatting.ipynb (step i): Merge covariates and genotype PCA

  2. covariate_hidden_factor.ipynb (step ii): Compute residual on merged covariates and perform hidden factors analysis

Steps#

i. Merge Covariates and Genotype PCs#

You can edit the total amount of variation you want your PCs to explain by editing the --k parameter. In this example, we chose 80%.

sos run pipeline/covariate_formatting.ipynb merge_genotype_pc \
    --cwd output/covariate/ \
    --pcaFile output/genotype/genotype_pca/wgs.merged.plink_qc.plink_qc.prune.pca.rds \
    --covFile data/covariate/covariates.tsv \
    --tol-cov 0.4 \
    --k `awk '$3 < 0.8' output/genotype/genotype_pca/wgs.merged.plink_qc.plink_qc.prune.pca.scree.txt | tail -1 | cut -f 1 ` 

ii. Compute Residual on Merged Covariates and Perform Hidden Factor Analysis#

sos run pipeline/covariate_hidden_factor.ipynb Marchenko_PC \
   --cwd output/covariate \
   --phenoFile output/rnaseq/bulk_rnaseq_tmp_matrix.low_expression_filtered.outlier_removed.tmm.expression.bed.bed.gz  \
   --covFile output/covariate/covariates.wgs.merged.plink_qc.plink_qc.prune.pca.gz \
   --mean-impute-missing 

Anticipated Results#

Processed covariate data includes a file with covariates and hidden factors for use in TensorQTL.