Phenotype Preprocessing#
This notebook is a minimal working example (MWE) of the phenotype preprocessing section of the xQTL protocol. It chains the phenotype worker pipelines on the bundled toy dataset (60 samples, protocol_example.*) so the whole section runs end-to-end without editing.
Description#
The phenotype preprocessing section takes a molecular phenotype matrix through genomic-coordinate annotation, missing-value imputation, and partition by chromosome, producing a BED-formatted phenotype ready for QTL association testing. It chains three worker pipelines:
gene_annotation.ipynb— add genomic coordinate annotation and convert to.bedphenotype_imputation.ipynb— impute missing values (gEBMF and other methods)phenotype_formatting.ipynb— partition the phenotype by chromosome
Timing: < 12 min for the toy dataset on a single node.
Input Files#
File |
Description |
|---|---|
|
Gene-level molecular phenotype matrix (60 samples) |
|
Protein phenotype with missing values (for imputation) |
|
GENCODE/Ensembl gene coordinate annotation |
Output Files#
File |
Description |
|---|---|
|
Coordinate-annotated phenotype in BED format |
|
Imputed phenotype matrix |
|
Per-chromosome phenotype, ready for TensorQTL |
Steps#
Run the commands below in order from the toy example root. Each one invokes the corresponding worker pipeline through the pipeline/ symlinks.
Annotate phenotype with genomic coordinates#
Add gene-level genomic coordinates to the molecular phenotype matrix and convert it to BED format, using the gene annotation GTF.
sos run pipeline/gene_annotation.ipynb annotate_coord \
--cwd output/rnaseq \
--phenoFile input/rnaseq/protocol_example.rnaseq.bed.gz \
--coordinate-annotation input/reference_data/Homo_sapiens.GRCh38.103.chr.reformatted.collapse_only.gene.ERCC.gtf \
--phenotype-id-column gene_id
Impute missing values#
Impute missing entries in the phenotype matrix. Here the gEBMF (greedy Empirical Bayes Matrix Factorization) method is used on the toy protein phenotype; other methods (EBMF, missForest, missXGBoost, knn, soft, mean, lod) are also available in the worker pipeline.
sos run pipeline/phenotype_imputation.ipynb gEBMF \
--phenoFile input/proteomics/protocol_example.protein.missing.bed.gz \
--cwd output/phenotype_imputation_uf \
--num_factor 30
Partition phenotype by chromosome#
Split the annotated phenotype into per-chromosome files (here chromosome 22), the format expected by the QTL association steps.
Anticipated Results#
The pipeline produces output files in the output/ subdirectory named after the workflow step. Verify success by checking that output files exist and are non-empty. See the Output section above for the expected file names and formats.
sos run pipeline/phenotype_formatting.ipynb phenotype_by_chrom \
--cwd output/phenotype/phenotype_by_chrom_for_cis \
--phenoFile output/rnaseq/protocol_example.rnaseq.bed.bed.gz \
--name bulk_rnaseq \
--chrom chr22