Bulk RNA-seq counts normalization#
Description#
The normalization step follows steps used by the GTeX pipeline. Genes are first filtered to keep genes where TPM is greater than 10% in at least 20% of the samples. They are also kept if read counts is greater than 6 in at least 20% of the samples. The filtered data is then normalized using the Trimmed Mean of M-value (TMM) method.
Timing: ~1-2 min on typical compute infrastructure.
Input Files#
File |
Description |
|---|---|
|
TPM matrix in RNA-SeQC GCT format (output of |
|
Read-count matrix in GCT format (output of |
|
Collapsed gene-model GTF; gene names must match the GCT matrices and use the |
|
Tab-delimited, header, two columns mapping expression sample name to genotype sample name (must cover every sample in the matrices) |
Steps#
Filter and TMM-normalize the QC-ed count/TPM matrices into a phenotype bed.gz. The inputs here are the outputs of the bulk_expression_QC step; --count-threshold sets the minimum read count used in the GTeX-style gene filter.
sos run pipeline/bulk_expression_normalization.ipynb normalize \
--cwd output/rnaseq \
--tpm-gct output/rnaseq/protocol_example.low_expression_filtered.outlier_removed.tpm.gct.gz \
--counts-gct output/rnaseq/protocol_example.low_expression_filtered.outlier_removed.geneCount.gct.gz \
--annotation-gtf reference_data/Homo_sapiens.GRCh38.103.chr.reformatted.collapse_only.gene.ERCC.gtf \
--count-threshold 1 \
--sample_participant_lookup input/rnaseq/protocol_example.rnaseq.sample_participant_lookup.txt
Output Files#
File |
Description |
|---|---|
|
TMM/CPM/voom-normalized expression matrix in BED format ( |
Troubleshooting#
Step |
Substep |
Problem |
Possible Reason |
Solution |
|---|---|---|---|---|
Command interface#
Anticipated Results#
The pipeline produces output files in the output/ subdirectory named after the workflow step. Verify success by checking that output files exist and are non-empty. See the Output section above for the expected file names and formats.
!sos run bulk_expression_normalization.ipynb -h
/home/caox2/.pixi/envs/python/lib/python3.12/site-packages/sos/targets.py:22: UserWarning: pkg_resources is deprecated as an API. See https://setuptools.pypa.io/en/latest/pkg_resources.html. The pkg_resources package is slated for removal as early as 2025-11-30. Refrain from using this package or pin to Setuptools<81.
import pkg_resources
usage: sos run bulk_expression_normalization.ipynb
[workflow_name | -t targets] [options] [workflow_options]
workflow_name: Single or combined workflows defined in this script
targets: One or more targets to generate
options: Single-hyphen sos parameters (see "sos run -h" for details)
workflow_options: Double-hyphen workflow-specific parameters
Workflows:
normalize
Global Workflow Options:
--cwd output (as path)
Work directory & output directory
--counts-gct VAL (as path, required)
gene count table
--tpm-gct VAL (as path, required)
gene TPM table
--annotation-gtf VAL (as path, required)
gene gtf annotation table
--sample-participant-lookup VAL (as path, required)
A file to map sample ID from expression to genotype,must
contain two columns, sample_id and participant_id,
mapping IDs in the expression files to IDs in the
genotype (these can be the same).
--tpm-threshold 0.1 (as float)
--count-threshold 6 (as int)
--sample-frac-threshold 0.2 (as float)
--normalization-method 'tmm_cpm_voom'
Normalization method: TMM + CPM(voom) (tmm_cpm_voom),
TMM + CPM(edgeR) (tmm_cpm_edger), or quantile
normalization (qn)
--[no-]quantile-normalize (default to True)
--job-size 1 (as int)
For cluster jobs, number commands to run per job
--walltime 5h
Wall clock time expected
--mem 16G
Memory expected
--numThreads 20 (as int)
Number of threads
--container ''
Sections
normalize:
[global]
parameter: modular_script_dir = path('code/script') # override with --modular-script-dir
# Work directory & output directory
parameter: cwd = path("output")
# gene count table
parameter: counts_gct = path
# gene TPM table
parameter: tpm_gct = path
# gene gtf annotation table
parameter: annotation_gtf = path
# A file to map sample ID from expression to genotype,must contain two columns, sample_id and participant_id, mapping IDs in the expression files to IDs in the genotype (these can be the same).
parameter: sample_participant_lookup = path
parameter: tpm_threshold = 0.1
parameter: count_threshold = 6
parameter: sample_frac_threshold = 0.2
# Normalization method: TMM + CPM(voom) (tmm_cpm_voom), TMM + CPM(edgeR) (tmm_cpm_edger), or quantile normalization (qn)
parameter: normalization_method = 'tmm_cpm_voom'
# Quantile Normalization after rescale
parameter: quantile_normalize = True
# For cluster jobs, number commands to run per job
parameter: job_size = 1
# Wall clock time expected
parameter: walltime = "5h"
# Memory expected
parameter: mem = "16G"
# Number of threads
parameter: numThreads = 20
parameter: container = ""
import re
parameter: entrypoint= ('micromamba run -a "" -n' + ' ' + re.sub(r'(_apptainer:latest|_docker:latest|\.sif)$', '', container.split('/')[-1])) if container else ""
[normalize]
# Path to the input molecular phenotype data, should be a processd and indexed bed.gz file, with tabix index.
input: tpm_gct, counts_gct, annotation_gtf, sample_participant_lookup
output: f'{cwd:a}/{_input[0]:bnnn}.{normalization_method}.{"no_qnorm." if not quantile_normalize else ""}expression.bed.gz'
task: trunk_workers = 1, trunk_size = job_size, walltime = walltime, mem = mem, tags = f'{step_name}_{_output[0]:bn}'
bash: expand= "${ }", stderr = f'{_output[0]:nn}.stderr', stdout = f'{_output[0]:nn}.stdout', container = container, entrypoint = entrypoint
python3 ${modular_script_dir}/molecular_phenotypes/QC/bulk_expression_normalization.py \
--step normalize \
--cwd "${cwd}" \
--tpm-gct "${_input[0]}" \
--counts-gct "${_input[1]}" \
--annotation-gtf "${_input[2]}" \
--sample-participant-lookup "${_input[3]}" \
--tpm-threshold ${tpm_threshold} \
--count-threshold ${count_threshold} \
--sample-frac-threshold ${sample_frac_threshold} \
--normalization-method "${normalization_method}" \
${"--quantile-normalize" if quantile_normalize else ""} \
--numThreads ${numThreads}