Phenotype data preprocessing#
This mini-protocol documents the shared post processing step and some utilities to handle molecular phenotype files including imputations.
Miniprotocol Timing#
This represents the total duration for all miniprotocol phases. While module-specific timings are provided separately on their respective pages, they are also included in this overall estimate.
Timing < 12 minutes
Overview#
This workflow is an application of the phenotype related workflows from the xQTL project pipeline.
gene_annotation.ipynb
(step i): Adds genomic coordinate annotation to gene-level molecular phenotype files and converts them to.bed
formatphenotype_imputation.ipynb
(step ii): Impute missing entries of molecular phenotype dataphenotype_formatting.ipynb
(step iii): Splits each phenotype file by chromosome
Steps#
i. Phenotype Annotation#
This step serves as annote cooresponding chr
, start
, end
, and gene_id
to genes in the original phenotype matrix.
sos run pipeline/gene_annotation.ipynb annotate_coord \
--cwd output/rnaseq \
--phenoFile output/rnaseq/bulk_rnaseq_tmp_matrix.low_expression_filtered.outlier_removed.tmm.expression.bed.gz \
--coordinate-annotation reference_data/Homo_sapiens.GRCh38.103.chr.reformatted.collapse_only.gene.ERCC.gtf \
--phenotype-id-column gene_id
ii. Missing Value Imputation#
This step serves as impute the missing entries for molecular phenotype data. This step is optional for eQTL analysis. But for other QTL analysis, this step is necessary. The missing entries are imputed by flashier
, a Empirical Bayes Matrix Factorization model.
sos run pipeline/phenotype_imputation.ipynb gEBMF \
--phenoFile data/protocol_example.protein.bed.gz \
--cwd output/phenotype/impute_gebmf \
--no-qc-prior-to-impute
iii. Partition by Chromosome#
sos run pipeline/phenotype_formatting.ipynb phenotype_by_chrom \
--cwd output/phenotype/phenotype_by_chrom \
--phenoFile output/rnaseq/bulk_rnaseq_tmp_matrix.low_expression_filtered.outlier_removed.tmm.expression.bed.bed.gz \
--name bulk_rnaseq \
--chrom `for i in {1..22}; do echo chr$i; done`
Anticipated Results#
Phenotype preprocessing should result in a phenotype file formatted and ready for use in TensorQTL.