Phenotype data preprocessing

Contents

Phenotype data preprocessing#

This mini-protocol documents the shared post processing step and some utilities to handle molecular phenotype files including imputations.

Miniprotocol Timing#

This represents the total duration for all miniprotocol phases. While module-specific timings are provided separately on their respective pages, they are also included in this overall estimate.

Timing < 12 minutes

Overview#

This workflow is an application of the phenotype related workflows from the xQTL project pipeline.

gene_annotation.ipynb (step i): Adds genomic coordinate annotation to gene-level molecular phenotype files and converts them to .bed format
phenotype_imputation.ipynb (step ii): Impute missing entries of molecular phenotype data
phenotype_formatting.ipynb (step iii): Splits each phenotype file by chromosome

Steps#

i. Phenotype Annotation #

sos run pipeline/gene_annotation.ipynb annotate_coord \
    --cwd output/rnaseq \
    --phenoFile output/rnaseq/bulk_rnaseq_tmp_matrix.low_expression_filtered.outlier_removed.tmm.expression.bed.gz \
    --coordinate-annotation reference_data/Homo_sapiens.GRCh38.103.chr.reformatted.collapse_only.gene.ERCC.gtf \
    --phenotype-id-column gene_id

ii. Missing Value Imputation #

sos run pipeline/phenotype_imputation.ipynb gEBMF \
    --phenoFile data/protocol_example.protein.bed.gz \
    --cwd output/phenotype/impute_gebmf \
    --no-qc-prior-to-impute 

iii. Partition by Chromosome #

sos run pipeline/phenotype_formatting.ipynb phenotype_by_chrom \
    --cwd output/phenotype/phenotype_by_chrom \
    --phenoFile output/rnaseq/bulk_rnaseq_tmp_matrix.low_expression_filtered.outlier_removed.tmm.expression.bed.bed.gz \
    --name bulk_rnaseq \
    --chrom `for i in {1..22}; do echo chr$i; done`

Anticipated Results#

Phenotype preprocessing should result in a phenotype file formatted and ready for use in TensorQTL.