Phenotype Data Formatting#
FIXME: this entire pipeline needs to be improved
Description#
We include a collection of workflows to format molecular phenotype data. These include workflows to separate phenotypes by chromosome, by user-provided regions, a workflow to subset bam files and a workflow to extract samples from phenotype files.
Input#
The input for this workflow is the collection of data for 1 molecular phenotype as described in the format of:
a complete residualized (covariates regressed out) molecular phenotype data
a region list
These input are outputs from previous pipelines such as covariate_preprocessing and gene_annotation.
Output#
A list of phenotype file (bed+index) for each chrom, annotated with genomic coordiates, suitable for TensorQTL analysis.
A lists of phenotype file (bed+index) for each gene, annotated with genomic coordiates, suitable for fine-mapping.
Minimal Working Example Steps#
The data and singularity image used are available on Synapse.
iii. Partition by chromosome#
This is necessary for cis TensorQTL analysis.
Timing < 1 min
sos run pipeline/phenotype_formatting.ipynb phenotype_by_chrom \
--cwd output/phenotype/phenotype_by_chrom \
--phenoFile output/rnaseq/bulk_rnaseq_tmp_matrix.low_expression_filtered.outlier_removed.tmm.expression.bed.bed.gz \
--name bulk_rnaseq \
--chrom `for i in {1..22}; do echo chr$i; done`
INFO: Running phenotype_by_chrom_1:
INFO: phenotype_by_chrom_1 (index=0) is completed.
INFO: phenotype_by_chrom_1 (index=1) is completed.
INFO: phenotype_by_chrom_1 (index=2) is completed.
INFO: phenotype_by_chrom_1 (index=3) is completed.
INFO: phenotype_by_chrom_1 (index=4) is completed.
INFO: phenotype_by_chrom_1 (index=5) is completed.
INFO: phenotype_by_chrom_1 (index=6) is completed.
INFO: phenotype_by_chrom_1 (index=7) is completed.
INFO: phenotype_by_chrom_1 (index=8) is completed.
INFO: phenotype_by_chrom_1 (index=9) is completed.
INFO: phenotype_by_chrom_1 (index=10) is completed.
INFO: phenotype_by_chrom_1 (index=11) is completed.
INFO: phenotype_by_chrom_1 (index=12) is completed.
INFO: phenotype_by_chrom_1 (index=13) is completed.
INFO: phenotype_by_chrom_1 (index=14) is completed.
INFO: phenotype_by_chrom_1 (index=15) is completed.
INFO: phenotype_by_chrom_1 (index=16) is completed.
INFO: phenotype_by_chrom_1 (index=17) is completed.
INFO: phenotype_by_chrom_1 (index=18) is completed.
INFO: phenotype_by_chrom_1 (index=19) is completed.
INFO: phenotype_by_chrom_1 (index=20) is completed.
INFO: phenotype_by_chrom_1 (index=21) is completed.
INFO: phenotype_by_chrom_1 output: output/phenotype/phenotype_by_chrom/bulk_rnaseq.chr20.bed.gz output/phenotype/phenotype_by_chrom/bulk_rnaseq.chr10.bed.gz... (22 items in 22 groups)
INFO: Running phenotype_by_chrom_2:
INFO: phenotype_by_chrom_2 is completed.
INFO: phenotype_by_chrom_2 output: output/phenotype/phenotype_by_chrom/bulk_rnaseq.phenotype_by_chrom_files.txt output/phenotype/phenotype_by_chrom/bulk_rnaseq.phenotype_by_chrom_files.region_list.txt
INFO: Workflow phenotype_by_chrom (ID=w63bafed6f59e12a5) is executed successfully with 2 completed steps and 23 completed substeps.
Troubleshooting#
Step |
Substep |
Problem |
Possible Reason |
Solution |
|---|---|---|---|---|
Command Interface#
!sos run phenotype_formatting.ipynb -h
usage: sos run phenotype_formatting.ipynb
[workflow_name | -t targets] [options] [workflow_options]
workflow_name: Single or combined workflows defined in this script
targets: One or more targets to generate
options: Single-hyphen sos parameters (see "sos run -h" for details)
workflow_options: Double-hyphen workflow-specific parameters
Workflows:
phenotype_by_chrom
phenotype_by_region
bam_subsetting
gct_extract_samples
Global Workflow Options:
--cwd output (as path)
Work directory & output directory
--container ''
The filename namefor output data
--entrypoint ('micromamba run -a "" -n' + ' ' + re.sub(r'(_apptainer:latest|_docker:latest|\.sif)$', '', container.split('/')[-1])) if container else ""
--job-size 1 (as int)
For cluster jobs, number commands to run per job
--walltime 5h
Wall clock time expected
--mem 16G
Memory expected
--numThreads 20 (as int)
Number of threads
--phenoFile VAL (as path, required)
Path to the input molecular phenotype data.
--name f'{phenoFile:bn}'
name for the analysis output
Sections
phenotype_by_chrom_1:
Workflow Options:
--chrom VAL VAL ... (as type, required)
list of chroms to extract
phenotype_by_chrom_2:
phenotype_by_region_1:
Workflow Options:
--region-list VAL (as path, required)
An index text file with 4 columns specifying the chr,
start, end and name of regions to analyze
phenotype_by_region_2:
bam_subsetting:
Workflow Options:
--region VAL VAL ... (as type, required)
Input to `samtools view` coordinates, for example,
--region chr21 chr22
gct_extract_samples: Extract samples from expression data generated by
RNASeQC
Workflow Options:
--keep-samples VAL (as path, required)
Setup and global parameters#
[global]
parameter: renovated_code_dir = path('renovated_code/script') # override with --renovated-code-dir
import os
# Work directory & output directory
parameter: cwd = path("output")
# The filename namefor output data
parameter: container = ''
import re
parameter: entrypoint= ""
# For cluster jobs, number commands to run per job
parameter: job_size = 1
# Wall clock time expected
parameter: walltime = "5h"
# Memory expected
parameter: mem = "16G"
# Number of threads
parameter: numThreads = 20
# Path to the input molecular phenotype data.
parameter: phenoFile = path
# name for the analysis output
parameter: name = str(phenoFile).removesuffix('.gz').removesuffix('.bed')
Process of molecular phenotype file#
This workflow produce a bed+tabix file for all the molecular pheno data that are included in the region list to feed into downstream analysis
[phenotype_by_chrom_1]
# list of chroms to extract
parameter: chrom = list
chrom = list(set(chrom))
# Path to the input molecular phenotype data.
input: phenoFile, for_each = "chrom"
output: f'{cwd}/{name}.{_chrom}.bed.gz'
task: trunk_workers = 1, trunk_size = job_size, walltime = walltime, mem = mem, tags = f'{step_name}_{_output:bn}'
bash: expand= "${ }", stderr = f'{_output:n}.stderr', stdout = f'{_output:n}.stdout', container = container, entrypoint = entrypoint
bash ${renovated_code_dir}/data_preprocessing/phenotype/phenotype_formatting.sh phenotype_by_chrom_1 \
--cwd "${cwd}" \
--phenoFile "${_input}" \
--name "${name}" \
--chrom "${_chrom}" \
--output "${_output}" \
--numThreads ${numThreads}
[phenotype_by_chrom_2]
# Path to the input molecular phenotype data.
input: group_by = "all"
output: f'{cwd}/{name}.{step_name[:-2]}_files.txt',f'{cwd}/{name}.{step_name[:-2]}_files.region_list.txt'
bash: expand= "${ }", stderr = f'{_output[0]:n}.stderr', stdout = f'{_output[0]:n}.stdout', container = container, entrypoint = entrypoint
bash ${renovated_code_dir}/data_preprocessing/phenotype/phenotype_formatting.sh phenotype_by_chrom_2 \
--cwd "${cwd}" \
--phenoFile "${phenoFile}" \
--inputs ${_input} \
--output-files "${_output[0]}" \
--output-region-list "${_output[1]}"
[phenotype_annotate_by_tad]
parameter: TAD_list = path
parameter: phenotype_per_tad = 2 # This is the minimum number of epigenomics marker for a tadb to be considered having a functions.
input: phenoFile,TAD_list
output: f'{cwd}/{_input[0]:b}.{_input[1]:b}.{phenotype_per_tad}_pheno_per_region.region_list'
bash: expand= "${ }", stderr = f'{_output:n}.stderr', stdout = f'{_output:n}.stdout', container = container, entrypoint = entrypoint
Rscript ${renovated_code_dir}/data_preprocessing/phenotype/phenotype_formatting.R \
--step phenotype_annotate_by_tad \
--cwd "${cwd}" \
--phenoFile "${_input[0]}" \
--TAD_list "${_input[1]}" \
--phenotype-per-tad ${phenotype_per_tad} \
--output "${_output}" \
--numThreads ${numThreads}
[phenotype_by_chrom_gct_1]
# list of chroms to extract
parameter: chrom = list
chrom = list(set(chrom))
# Path to the input molecular phenotype data.
input: phenoFile, for_each = "chrom"
output: f'{cwd:a}/{name}.{_chrom}.gct'
task: trunk_workers = 1, trunk_size = job_size, walltime = walltime, mem = mem, tags = f'{step_name}_{_output:bn}'
bash: expand= "${ }", stderr = f'{_output:n}.stderr', stdout = f'{_output:n}.stdout', container = container, entrypoint = entrypoint
bash ${renovated_code_dir}/data_preprocessing/phenotype/phenotype_formatting.sh phenotype_by_chrom_gct_1 \
--cwd "${cwd}" \
--phenoFile "${_input}" \
--name "${name}" \
--chrom "${_chrom}" \
--output "${_output}" \
--numThreads ${numThreads}
[phenotype_by_chrom_gct_2]
# Path to the input molecular phenotype data.
input: group_by = "all"
output: f'{cwd}/{name}.{step_name[:-2]}_files.txt',f'{cwd}/{name}.{step_name[:-2]}_files.region_list.txt'
bash: expand= "${ }", stderr = f'{_output[0]:n}.stderr', stdout = f'{_output[0]:n}.stdout', container = container, entrypoint = entrypoint
bash ${renovated_code_dir}/data_preprocessing/phenotype/phenotype_formatting.sh phenotype_by_chrom_gct_2 \
--cwd "${cwd}" \
--phenoFile "${phenoFile}" \
--inputs ${_input} \
--output-files "${_output[0]}" \
--output-region-list "${_output[1]}"
[phenotype_by_region_1]
# An index text file with 4 columns specifying the chr, start, end and name of regions to analyze
parameter: region_list = path
regions = [x.strip().split() for x in open(region_list).readlines() if x.strip() and not x.strip().startswith('#')]
# Get the unique chormosome that have regions to be analyzed.
def extract_chrom(lst):
return list(set([item[0] for item in lst]))
chrom = extract_chrom(regions)
# Path to the input molecular phenotype data.
input: phenoFile, for_each = "regions"
output: f'{cwd}/{region_list:bn}_phenotype_by_region/{name}.{_regions[3]}.bed.gz'
task: trunk_workers = 1, trunk_size = job_size, walltime = walltime, mem = mem, tags = f'{step_name}_{_output:bn}'
bash: expand= "${ }", stderr = f'{_output:n}.stderr', stdout = f'{_output:n}.stdout', container = container, entrypoint = entrypoint
bash ${renovated_code_dir}/data_preprocessing/phenotype/phenotype_formatting.sh phenotype_by_region_1 \
--cwd "${cwd}" \
--phenoFile "${_input}" \
--name "${name}" \
--region "${_regions[0]}" "${_regions[1]}" "${_regions[2]}" "${_regions[3]}" \
--output "${_output}" \
--numThreads ${numThreads}
[phenotype_by_region_2]
input: group_by = "all"
output: f'{cwd}/{name}.{step_name[:-2]}_files.txt'
bash: expand= "${ }", stderr = f'{_output:n}.stderr', stdout = f'{_output:n}.stdout', container = container, entrypoint = entrypoint
bash ${renovated_code_dir}/data_preprocessing/phenotype/phenotype_formatting.sh phenotype_by_region_2 \
--cwd "${cwd}" \
--inputs ${_input} \
--output "${_output}"
[bam_subsetting]
# Input to `samtools view` coordinates, for example, --region chr21 chr22
parameter: region = list
# Path to the input molecular phenotype data.
parameter: phenoFile = paths
input: phenoFile , group_by = 1
output: f'{cwd}/{_input:bn}.subsetted.bam'
task: trunk_workers = 1, trunk_size = job_size, walltime = walltime, mem = mem, cores = numThreads
bash: expand= "${ }", stderr = f'{_output:n}.stderr', stdout = f'{_output:n}.stdout', container=container, entrypoint=entrypoint
samtools view -b ${_input} ${region} > ${_output}
# Extract samples from expression data generated by RNASeQC
[gct_extract_samples]
parameter: keep_samples = path
input: phenoFile
output: f'{_input[0]:nn}.sample_matched.gct.gz'
task: trunk_workers = 1, trunk_size = job_size, walltime = walltime, mem = mem, cores = numThreads, tags = f'{step_name}_{_output:bn}'
bash: expand= "${ }", stderr = f'{_output:nn}.stderr', stdout = f'{_output:nn}.stdout', container = container, entrypoint = entrypoint
Rscript ${renovated_code_dir}/data_preprocessing/phenotype/phenotype_formatting.R \
--step gct_extract_samples \
--cwd "${cwd}" \
--phenoFile "${_input[0]}" \
--keep-samples "${keep_samples}" \
--output "${_output}" \
--numThreads ${numThreads}