Phenotype data preprocessing#

This mini-protocol documents the shared post processing step and some utilities to handle molecular phenotype files including imputations.

Miniprotocol Timing#

This represents the total duration for all miniprotocol phases. While module-specific timings are provided separately on their respective pages, they are also included in this overall estimate.

Timing < 12 minutes

Overview#

This workflow is an application of the phenotype related workflows from the xQTL project pipeline.

  1. gene_annotation.ipynb (step i): Adds genomic coordinate annotation to gene-level molecular phenotype files and converts them to .bed format

  2. phenotype_imputation.ipynb (step ii): Impute missing entries of molecular phenotype data

  3. phenotype_formatting.ipynb (step iii): Splits each phenotype file by chromosome

Steps#

i. Phenotype Annotation#

This step serves as annote cooresponding chr, start, end, and gene_id to genes in the original phenotype matrix.

sos run pipeline/gene_annotation.ipynb annotate_coord \
    --cwd output/rnaseq \
    --phenoFile output/rnaseq/bulk_rnaseq_tmp_matrix.low_expression_filtered.outlier_removed.tmm.expression.bed.gz \
    --coordinate-annotation reference_data/Homo_sapiens.GRCh38.103.chr.reformatted.collapse_only.gene.ERCC.gtf \
    --phenotype-id-column gene_id

ii. Missing Value Imputation#

This step serves as impute the missing entries for molecular phenotype data. This step is optional for eQTL analysis. But for other QTL analysis, this step is necessary. The missing entries are imputed by flashier, a Empirical Bayes Matrix Factorization model.

sos run pipeline/phenotype_imputation.ipynb gEBMF \
    --phenoFile data/protocol_example.protein.bed.gz \
    --cwd output/phenotype/impute_gebmf \
    --no-qc-prior-to-impute 

iii. Partition by Chromosome#

sos run pipeline/phenotype_formatting.ipynb phenotype_by_chrom \
    --cwd output/phenotype/phenotype_by_chrom \
    --phenoFile output/rnaseq/bulk_rnaseq_tmp_matrix.low_expression_filtered.outlier_removed.tmm.expression.bed.bed.gz \
    --name bulk_rnaseq \
    --chrom `for i in {1..22}; do echo chr$i; done`

Anticipated Results#

Phenotype preprocessing should result in a phenotype file formatted and ready for use in TensorQTL.