Phenotype Data Formatting

Phenotype Data Formatting#

FIXME: this entire pipeline needs to be improved

Description#

We include a collection of workflows to format molecular phenotype data. These include workflows to separate phenotypes by chromosome, by user-provided regions, a workflow to subset bam files and a workflow to extract samples from phenotype files.

Input Files#

File	Description
`output/rnaseq/protocol_example.rnaseq.bed.bed.gz`	Residualized molecular phenotype data as a bgzipped, tabix-indexed BED (`#chr`/`start`/`end`/`ID` + one column per sample). Output of upstream `covariate_preprocessing` / `gene_annotation`.
`output/rnaseq/protocol_example.rnaseq.bed.bed.gz.tbi`	Tabix index for the phenotype BED, required to extract records per chromosome.

Output Files#

File	Description
`output/phenotype_uf/protocol_example.<chrom>.bed.gz` (+`.tbi`)	Per-chromosome phenotype BED (bed+index) annotated with genomic coordinates, suitable for TensorQTL analysis.
`output/phenotype_uf/protocol_example.phenotype_by_chrom_files.txt`	Two-column list mapping each chromosome to its per-chromosome phenotype BED.
`output/phenotype_uf/protocol_example.phenotype_by_chrom_files.region_list.txt`	Region list (`#chr`/`start`/`end`/`ID`/`path`) for downstream cis-QTL / fine-mapping.

Minimal Working Example Steps#

The data and singularity image used are available on Synapse.

Step 1. Partition phenotype by chromosome#

This is necessary for cis TensorQTL analysis.

Timing < 1 min

sos run pipeline/phenotype_formatting.ipynb phenotype_by_chrom \
    --cwd output/phenotype_uf \
    --phenoFile output/rnaseq/protocol_example.rnaseq.bed.bed.gz \
    --name protocol_example \
    --chrom chr22

This runs the two phenotype_by_chrom substeps in sequence: substep _1 extracts the records for each requested chromosome from the tabix-indexed BED and writes a per-chromosome bed.gz (+ index); substep _2 collects those into the *_files.txt and *.region_list.txt lists. For this minimal example only chr22 is present in the toy phenotype data.

Troubleshooting#

Step	Substep	Problem	Possible Reason	Solution

Command Interface#

!sos run phenotype_formatting.ipynb -h

Setup and global parameters#

[global]
parameter: modular_script_dir = path('code/script')  # override with --modular-script-dir
import os
# Work directory & output directory
parameter: cwd = path("output")
# The filename namefor output data
parameter: container = ''
import re
parameter: entrypoint= ""
# For cluster jobs, number commands to run per job
parameter: job_size = 1
# Wall clock time expected
parameter: walltime = "5h"
# Memory expected
parameter: mem = "16G"
# Number of threads
parameter: numThreads = 20
# Path to the input molecular phenotype data.
parameter: phenoFile = path
# name for the analysis output
parameter: name = str(phenoFile).removesuffix('.gz').removesuffix('.bed')

Process of molecular phenotype file#

This workflow produce a bed+tabix file for all the molecular pheno data that are included in the region list to feed into downstream analysis

Anticipated Results#

The pipeline produces output files in the output/ subdirectory named after the workflow step. Verify success by checking that output files exist and are non-empty. See the Output section above for the expected file names and formats.

[phenotype_by_chrom_1]
# list of chroms to extract 
parameter: chrom = list
chrom = list(set(chrom))
# Path to the input molecular phenotype data.
input: phenoFile, for_each = "chrom"
output: f'{cwd}/{name}.{_chrom}.bed.gz'
task: trunk_workers = 1, trunk_size = job_size, walltime = walltime,  mem = mem, tags = f'{step_name}_{_output:bn}'  
bash: expand= "${ }", stderr = f'{_output:n}.stderr', stdout = f'{_output:n}.stdout', container = container, entrypoint = entrypoint
    bash ${modular_script_dir}/data_preprocessing/phenotype/phenotype_formatting.sh phenotype_by_chrom_1 \
        --cwd "${cwd}" \
        --phenoFile "${_input}" \
        --name "${name}" \
        --chrom "${_chrom}" \
        --output "${_output}" \
        --numThreads ${numThreads}

[phenotype_by_chrom_2]
# Path to the input molecular phenotype data.
input: group_by = "all"
output: f'{cwd}/{name}.{step_name[:-2]}_files.txt',f'{cwd}/{name}.{step_name[:-2]}_files.region_list.txt'
bash: expand= "${ }", stderr = f'{_output[0]:n}.stderr', stdout = f'{_output[0]:n}.stdout', container = container, entrypoint = entrypoint
    bash ${modular_script_dir}/data_preprocessing/phenotype/phenotype_formatting.sh phenotype_by_chrom_2 \
        --cwd "${cwd}" \
        --phenoFile "${phenoFile}" \
        --inputs ${_input} \
        --output-files "${_output[0]}" \
        --output-region-list "${_output[1]}"

[phenotype_annotate_by_tad]
parameter: TAD_list = path
parameter: phenotype_per_tad = 2 # This is the minimum number of epigenomics marker for a tadb to be considered having a functions.
input: phenoFile,TAD_list
output: f'{cwd}/{_input[0]:b}.{_input[1]:b}.{phenotype_per_tad}_pheno_per_region.region_list'
bash: expand= "${ }", stderr = f'{_output:n}.stderr', stdout = f'{_output:n}.stdout', container = container, entrypoint = entrypoint
    Rscript ${modular_script_dir}/data_preprocessing/phenotype/phenotype_formatting.R \
        --step phenotype_annotate_by_tad \
        --cwd "${cwd}" \
        --phenoFile "${_input[0]}" \
        --TAD_list "${_input[1]}" \
        --phenotype-per-tad ${phenotype_per_tad} \
        --output "${_output}" \
        --numThreads ${numThreads}

[phenotype_by_chrom_gct_1]
# list of chroms to extract 
parameter: chrom = list
chrom = list(set(chrom))
# Path to the input molecular phenotype data.
input: phenoFile, for_each = "chrom"
output: f'{cwd:a}/{name}.{_chrom}.gct'
task: trunk_workers = 1, trunk_size = job_size, walltime = walltime,  mem = mem, tags = f'{step_name}_{_output:bn}'  
bash: expand= "${ }", stderr = f'{_output:n}.stderr', stdout = f'{_output:n}.stdout', container = container, entrypoint = entrypoint
    bash ${modular_script_dir}/data_preprocessing/phenotype/phenotype_formatting.sh phenotype_by_chrom_gct_1 \
        --cwd "${cwd}" \
        --phenoFile "${_input}" \
        --name "${name}" \
        --chrom "${_chrom}" \
        --output "${_output}" \
        --numThreads ${numThreads}

[phenotype_by_region_1]
# An index text file with 4 columns specifying the chr, start, end and name of regions to analyze
parameter: region_list = path
regions = [x.strip().split() for x in open(region_list).readlines() if x.strip() and not x.strip().startswith('#')]
# Get the unique chormosome that have regions to be analyzed.
def extract_chrom(lst):
    return list(set([item[0] for item in lst]))
chrom = extract_chrom(regions)
# Path to the input molecular phenotype data.
input: phenoFile, for_each = "regions"
output: f'{cwd}/{region_list:bn}_phenotype_by_region/{name}.{_regions[3]}.bed.gz'
task: trunk_workers = 1, trunk_size = job_size, walltime = walltime,  mem = mem, tags = f'{step_name}_{_output:bn}'  
bash: expand= "${ }", stderr = f'{_output:n}.stderr', stdout = f'{_output:n}.stdout', container = container, entrypoint = entrypoint
    bash ${modular_script_dir}/data_preprocessing/phenotype/phenotype_formatting.sh phenotype_by_region_1 \
        --cwd "${cwd}" \
        --phenoFile "${_input}" \
        --name "${name}" \
        --region "${_regions[0]}" "${_regions[1]}" "${_regions[2]}" "${_regions[3]}" \
        --output "${_output}" \
        --numThreads ${numThreads}

[phenotype_by_region_2]
input: group_by = "all"
output: f'{cwd}/{name}.{step_name[:-2]}_files.txt'
bash: expand= "${ }", stderr = f'{_output:n}.stderr', stdout = f'{_output:n}.stdout', container = container, entrypoint = entrypoint
    bash ${modular_script_dir}/data_preprocessing/phenotype/phenotype_formatting.sh phenotype_by_region_2 \
        --cwd "${cwd}" \
        --inputs ${_input} \
        --output "${_output}"

[bam_subsetting]
# Input to `samtools view` coordinates, for example, --region chr21 chr22
parameter: region = list
# Path to the input molecular phenotype data.
parameter: phenoFile = paths
input: phenoFile , group_by = 1
output: f'{cwd}/{_input:bn}.subsetted.bam'
task: trunk_workers = 1, trunk_size = job_size, walltime = walltime, mem = mem, cores = numThreads
bash: expand= "${ }", stderr = f'{_output:n}.stderr', stdout = f'{_output:n}.stdout', container=container, entrypoint=entrypoint
    samtools view -b ${_input} ${region} > ${_output}

# Extract samples from expression data generated by RNASeQC
[gct_extract_samples]
parameter: keep_samples = path
input: phenoFile
output: f'{_input[0]:nn}.sample_matched.gct.gz'
task: trunk_workers = 1, trunk_size = job_size, walltime = walltime, mem = mem, cores = numThreads, tags = f'{step_name}_{_output:bn}'
bash: expand= "${ }", stderr = f'{_output:nn}.stderr', stdout = f'{_output:nn}.stdout', container = container, entrypoint = entrypoint
    Rscript ${modular_script_dir}/data_preprocessing/phenotype/phenotype_formatting.R \
        --step gct_extract_samples \
        --cwd "${cwd}" \
        --phenoFile "${_input[0]}" \
        --keep-samples "${keep_samples}" \
        --output "${_output}" \
        --numThreads ${numThreads}