Bulk RNA-seq counts normalization#

Description#

The normalization step follows steps used by the GTeX pipeline. Genes are first filtered to keep genes where TPM is greater than 10% in at least 20% of the samples. They are also kept if read counts is greater than 6 in at least 20% of the samples. The filtered data is then normalized using the Trimmed Mean of M-value (TMM) method.

Timing: ~1-2 min on typical compute infrastructure.

Input Files#

File

Description

output/rnaseq/protocol_example.low_expression_filtered.outlier_removed.tpm.gct.gz

TPM matrix in RNA-SeQC GCT format (output of bulk_expression_QC)

output/rnaseq/protocol_example.low_expression_filtered.outlier_removed.geneCount.gct.gz

Read-count matrix in GCT format (output of bulk_expression_QC)

reference_data/Homo_sapiens.GRCh38.103.chr.reformatted.collapse_only.gene.ERCC.gtf

Collapsed gene-model GTF; gene names must match the GCT matrices and use the chr prefix

input/rnaseq/protocol_example.rnaseq.sample_participant_lookup.txt

Tab-delimited, header, two columns mapping expression sample name to genotype sample name (must cover every sample in the matrices)

Steps#

Filter and TMM-normalize the QC-ed count/TPM matrices into a phenotype bed.gz. The inputs here are the outputs of the bulk_expression_QC step; --count-threshold sets the minimum read count used in the GTeX-style gene filter.

sos run pipeline/bulk_expression_normalization.ipynb normalize \
    --cwd output/rnaseq \
    --tpm-gct output/rnaseq/protocol_example.low_expression_filtered.outlier_removed.tpm.gct.gz \
    --counts-gct output/rnaseq/protocol_example.low_expression_filtered.outlier_removed.geneCount.gct.gz \
    --annotation-gtf reference_data/Homo_sapiens.GRCh38.103.chr.reformatted.collapse_only.gene.ERCC.gtf \
    --count-threshold 1 \
    --sample_participant_lookup input/rnaseq/protocol_example.rnaseq.sample_participant_lookup.txt 

Output Files#

File

Description

output/rnaseq/protocol_example.low_expression_filtered.outlier_removed.tmm_cpm_voom.expression.bed.gz

TMM/CPM/voom-normalized expression matrix in BED format (#chr, start, end, gene_id, then one column per sample); ready as a molecular phenotype for QTL/TWAS analysis

Troubleshooting#

Step

Substep

Problem

Possible Reason

Solution

Command interface#

Anticipated Results#

The pipeline produces output files in the output/ subdirectory named after the workflow step. Verify success by checking that output files exist and are non-empty. See the Output section above for the expected file names and formats.

!sos run bulk_expression_normalization.ipynb -h
/home/caox2/.pixi/envs/python/lib/python3.12/site-packages/sos/targets.py:22: UserWarning: pkg_resources is deprecated as an API. See https://setuptools.pypa.io/en/latest/pkg_resources.html. The pkg_resources package is slated for removal as early as 2025-11-30. Refrain from using this package or pin to Setuptools<81.
  import pkg_resources
usage: sos run bulk_expression_normalization.ipynb
               [workflow_name | -t targets] [options] [workflow_options]
  workflow_name:        Single or combined workflows defined in this script
  targets:              One or more targets to generate
  options:              Single-hyphen sos parameters (see "sos run -h" for details)
  workflow_options:     Double-hyphen workflow-specific parameters

Workflows:
  normalize

Global Workflow Options:
  --cwd output (as path)
                        Work directory & output directory
  --counts-gct VAL (as path, required)
                        gene count table
  --tpm-gct VAL (as path, required)
                        gene TPM table
  --annotation-gtf VAL (as path, required)
                        gene gtf annotation table
  --sample-participant-lookup VAL (as path, required)
                        A file to map sample ID from expression to genotype,must
                        contain two columns, sample_id and participant_id,
                        mapping IDs in the expression files to IDs in the
                        genotype (these can be the same).
  --tpm-threshold 0.1 (as float)
  --count-threshold 6 (as int)
  --sample-frac-threshold 0.2 (as float)
  --normalization-method 'tmm_cpm_voom'
                        Normalization method: TMM + CPM(voom) (tmm_cpm_voom),
                        TMM + CPM(edgeR) (tmm_cpm_edger), or quantile
                        normalization (qn)
  --[no-]quantile-normalize (default to True)
  --job-size 1 (as int)
                        For cluster jobs, number commands to run per job
  --walltime 5h
                        Wall clock time expected
  --mem 16G
                        Memory expected
  --numThreads 20 (as int)
                        Number of threads
  --container ''


Sections
  normalize:
[global]
parameter: modular_script_dir = path('code/script')  # override with --modular-script-dir
# Work directory & output directory
parameter: cwd = path("output")
#  gene count table
parameter: counts_gct = path
#  gene TPM table
parameter: tpm_gct = path
#  gene gtf annotation table
parameter: annotation_gtf = path
# A file to map sample ID from expression to genotype,must contain two columns, sample_id and participant_id, mapping IDs in the expression files to IDs in the genotype (these can be the same).
parameter: sample_participant_lookup = path
parameter: tpm_threshold = 0.1
parameter: count_threshold = 6
parameter: sample_frac_threshold = 0.2
# Normalization method: TMM + CPM(voom) (tmm_cpm_voom), TMM + CPM(edgeR) (tmm_cpm_edger), or quantile normalization (qn)
parameter: normalization_method = 'tmm_cpm_voom'
# Quantile Normalization after rescale
parameter: quantile_normalize = True
# For cluster jobs, number commands to run per job
parameter: job_size = 1
# Wall clock time expected
parameter: walltime = "5h"
# Memory expected
parameter: mem = "16G"
# Number of threads
parameter: numThreads = 20
parameter: container = ""
import re
parameter: entrypoint= ('micromamba run -a "" -n' + ' ' + re.sub(r'(_apptainer:latest|_docker:latest|\.sif)$', '', container.split('/')[-1])) if container else ""
[normalize]
# Path to the input molecular phenotype data, should be a processd and indexed bed.gz file, with tabix index.
input: tpm_gct, counts_gct, annotation_gtf, sample_participant_lookup
output: f'{cwd:a}/{_input[0]:bnnn}.{normalization_method}.{"no_qnorm." if not quantile_normalize else ""}expression.bed.gz'
task: trunk_workers = 1, trunk_size = job_size, walltime = walltime,  mem = mem, tags = f'{step_name}_{_output[0]:bn}'  
bash: expand= "${ }", stderr = f'{_output[0]:nn}.stderr', stdout = f'{_output[0]:nn}.stdout', container = container, entrypoint = entrypoint
    python3 ${modular_script_dir}/molecular_phenotypes/QC/bulk_expression_normalization.py \
        --step normalize \
        --cwd "${cwd}" \
        --tpm-gct "${_input[0]}" \
        --counts-gct "${_input[1]}" \
        --annotation-gtf "${_input[2]}" \
        --sample-participant-lookup "${_input[3]}" \
        --tpm-threshold ${tpm_threshold} \
        --count-threshold ${count_threshold} \
        --sample-frac-threshold ${sample_frac_threshold} \
        --normalization-method "${normalization_method}" \
        ${"--quantile-normalize" if quantile_normalize else ""} \
        --numThreads ${numThreads}