Phenotype Preprocessing

Phenotype Preprocessing#

This notebook is a minimal working example (MWE) of the phenotype preprocessing section of the xQTL protocol. It chains the phenotype worker pipelines on the bundled toy dataset (60 samples, protocol_example.*) so the whole section runs end-to-end without editing.

Description#

The phenotype preprocessing section takes a molecular phenotype matrix through genomic-coordinate annotation, missing-value imputation, and partition by chromosome, producing a BED-formatted phenotype ready for QTL association testing. It chains three worker pipelines:

gene_annotation.ipynb — add genomic coordinate annotation and convert to .bed
phenotype_imputation.ipynb — impute missing values (gEBMF and other methods)
phenotype_formatting.ipynb — partition the phenotype by chromosome

Timing: < 12 min for the toy dataset on a single node.

Input Files#

File	Description
`input/rnaseq/protocol_example.rnaseq.bed.gz`	Gene-level molecular phenotype matrix (60 samples)
`input/proteomics/protocol_example.protein.missing.bed.gz`	Protein phenotype with missing values (for imputation)
`reference_data/Homo_sapiens.GRCh38.103.chr.reformatted.collapse_only.gene.ERCC.gtf`	GENCODE/Ensembl gene coordinate annotation

Output Files#

File	Description
`output/rnaseq/protocol_example.rnaseq.bed.bed.gz`	Coordinate-annotated phenotype in BED format
`output/phenotype_imputation_uf/.imputed.`	Imputed phenotype matrix
`output/phenotype_uf/*.chr22.bed.gz`	Per-chromosome phenotype, ready for TensorQTL

Steps#

Run the commands below in order from the toy example root. Each one invokes the corresponding worker pipeline through the pipeline/ symlinks.

Annotate phenotype with genomic coordinates#

Add gene-level genomic coordinates to the molecular phenotype matrix and convert it to BED format, using the gene annotation GTF.

sos run pipeline/gene_annotation.ipynb annotate_coord \
    --cwd output/rnaseq \
    --phenoFile input/rnaseq/protocol_example.rnaseq.bed.gz \
    --coordinate-annotation input/reference_data/Homo_sapiens.GRCh38.103.chr.reformatted.collapse_only.gene.ERCC.gtf \
    --phenotype-id-column gene_id

Impute missing values#

Impute missing entries in the phenotype matrix. Here the gEBMF (greedy Empirical Bayes Matrix Factorization) method is used on the toy protein phenotype; other methods (EBMF, missForest, missXGBoost, knn, soft, mean, lod) are also available in the worker pipeline.

sos run pipeline/phenotype_imputation.ipynb gEBMF \
    --phenoFile input/proteomics/protocol_example.protein.missing.bed.gz \
    --cwd output/phenotype_imputation_uf \
    --num_factor 30

Partition phenotype by chromosome#

Split the annotated phenotype into per-chromosome files (here chromosome 22), the format expected by the QTL association steps.

Anticipated Results#

The pipeline produces output files in the output/ subdirectory named after the workflow step. Verify success by checking that output files exist and are non-empty. See the Output section above for the expected file names and formats.

sos run pipeline/phenotype_formatting.ipynb phenotype_by_chrom \
    --cwd output/phenotype/phenotype_by_chrom_for_cis \
    --phenoFile output/rnaseq/protocol_example.rnaseq.bed.bed.gz \
    --name bulk_rnaseq \
    --chrom chr22