Phenotype data preprocessing#
This mini-protocol documents the shared post processing step and some utilities to handle molecular phenotype files including imputations.
Miniprotocol Timing#
This represents the total duration for all miniprotocol phases. While module-specific timings are provided separately on their respective pages, they are also included in this overall estimate.
Timing < 12 minutes
Overview#
This workflow is an application of the phenotype related workflows from the xQTL project pipeline.
gene_annotation.ipynb
(step i): Adds genomic coordinate annotation to gene-level molecular phenotype files and converts them to.bed
formatphenotype_imputation.ipynb
(step ii): Impute missing entries of molecular phenotype dataphenotype_formatting.ipynb
(step iii): Splits each phenotype file by chromosome
Steps#
i. Phenotype Annotation#
sos run pipeline/gene_annotation.ipynb annotate_coord \
--cwd output/rnaseq \
--phenoFile output/rnaseq/bulk_rnaseq_tmp_matrix.low_expression_filtered.outlier_removed.tmm.expression.bed.gz \
--coordinate-annotation reference_data/Homo_sapiens.GRCh38.103.chr.reformatted.collapse_only.gene.ERCC.gtf \
--phenotype-id-column gene_id
ii. Missing Value Imputation#
sos run pipeline/phenotype_imputation.ipynb gEBMF \
--phenoFile data/protocol_example.protein.bed.gz \
--cwd output/phenotype/impute_gebmf \
--no-qc-prior-to-impute
iii. Partition by Chromosome#
sos run pipeline/phenotype_formatting.ipynb phenotype_by_chrom \
--cwd output/phenotype/phenotype_by_chrom \
--phenoFile output/rnaseq/bulk_rnaseq_tmp_matrix.low_expression_filtered.outlier_removed.tmm.expression.bed.bed.gz \
--name bulk_rnaseq \
--chrom `for i in {1..22}; do echo chr$i; done`
Anticipated Results#
Phenotype preprocessing should result in a phenotype file formatted and ready for use in TensorQTL.