Sample-level RNA-seq quality control#

Description#

Sample-level QC of a bulk RNA-seq expression matrix, following the GTEx V8 approach. It runs in two parts: qc_1 removes low-expression genes (genes expressed below the TPM threshold in too many samples), and qc_2 detects and removes outlier samples using three checks (a D-statistic histogram, a Relative Log Expression (RLE) plot, and hierarchical clustering). A raw count matrix, if provided, is filtered to the same genes and samples as the QC’d TPM matrix.

Timing: ~2-5 min on typical compute infrastructure.

Input Files#

Parameter

Example (toy)

Description

--tpm-gct

input/rnaseq/protocol_example.rnaseq.tpm_matrix.bed

Gene expression in TPM; gene ID in the first column, one column per sample. Required.

--counts-gct

input/rnaseq/protocol_example.rnaseq.count_matrix.bed

Raw gene count matrix. Optional; if given it is filtered to match the QC’d TPM genes/samples.

The toy inputs are symlinks under input/ pointing into the bundled toy dataset.

Output Files#

Written to --cwd (output/rnaseq/). The output basename is taken from the TPM input name truncated at the first dot, so here it is protocol_example:

File

Description

protocol_example.low_expression_filtered.tpm.gct.gz

TPM matrix after low-expression gene filtering (qc_1).

protocol_example.low_expression_filtered.outlier_removed.tpm.gct.gz

TPM matrix after outlier-sample removal (qc_2).

protocol_example.low_expression_filtered.outlier_removed.geneCount.gct.gz

Raw count matrix filtered to the same genes/samples.

protocol_example.low_expression_filtered.outlier_removed.tpm.gct.D_stat_hist.pdf

D-statistic histogram diagnostic plot.

protocol_example.low_expression_filtered.outlier_removed.tpm.gct.RLEplot.pdf

Relative Log Expression diagnostic plot.

protocol_example.low_expression_filtered.outlier_removed.tpm.gct.preQC_cluster.pdf

Sample clustering diagnostic plot.

The outlier_removed.tpm.gct.gz file is the input to multi-sample normalization.

Steps#

Note: DSFilterPercent defaults to 0.05; for the small toy cohort it can be raised (e.g. 0.1) so a single sample is not over-flagged. --container '' makes the workflow use the locally installed tools.

sos run pipeline/bulk_expression_QC.ipynb qc \
    --cwd output/rnaseq \
    --tpm-gct input/rnaseq/protocol_example.rnaseq.tpm_matrix.bed \
    --counts-gct input/rnaseq/protocol_example.rnaseq.count_matrix.bed 

Command interface#

sos run pipeline/bulk_expression_QC.ipynb -h

Workflow implementation#

[global]
parameter: modular_script_dir = path('code/script')  # override with --modular-script-dir
# Required input is TPM file
parameter: tpm_gct = path
# Raw counts file is optional and if available, it will be filtered to match with the TPM file sample and genes
parameter: counts_gct = path()
parameter: cwd = path("output")
parameter: container = ""
import re
parameter: entrypoint= ('micromamba run -a "" -n' + ' ' + re.sub(r'(_apptainer:latest|_docker:latest|\.sif)$', '', container.split('/')[-1])) if container else ""
cwd = path(f'{cwd:a}')
# For cluster jobs, number commands to run per job
parameter: job_size = 1
# Wall clock time expected
parameter: walltime = "5h"
# Memory expected
parameter: mem = "16G"

# Number of threads
parameter: numThreads = 8
[qc_1 (basic check and low expression filtering)]
parameter: low_expr_TPM = 0.1
parameter: low_expr_TPM_percent = 0.2
input: tpm_gct
output: f'{cwd}/{_input:bnnn}.low_expression_filtered.tpm.gct.gz'
task: trunk_workers = 1, trunk_size = job_size, walltime = walltime, mem = mem, cores = numThreads
bash: expand= "${ }", stderr = f'{_output:nnn}.stderr', stdout = f'{_output:nnn}.log', container = container, entrypoint = entrypoint
    Rscript ${modular_script_dir}/molecular_phenotypes/QC/bulk_expression_QC.R \
        --step qc_1 \
        --cwd "${cwd}" \
        --tpm-gct "${_input}" \
        --low-expr-TPM ${low_expr_TPM} \
        --low-expr-TPM-percent ${low_expr_TPM_percent} \
        --numThreads ${numThreads}

CAUTION: I (GW) am not sure what to do with the offset for log-transformation on TPM. GTEx suggests using offset = 1. Here I keep the recommendation from these authors where both 0.0001 and 1 were used in different steps.

Anticipated Results#

The pipeline produces output files in the output/ subdirectory named after the workflow step. Verify success by checking that output files exist and are non-empty. See the Output section above for the expected file names and formats.

[qc_2 (remove outliers)]
parameter: RLEFilterPercent = 0.05
parameter: DSFilterPercent = 0.05
parameter: topk_genes = 100
parameter: cluster_percent = 0.6
parameter: pvalue_cutoff = 0.05
parameter: cluster_level = 5
output: f'{cwd}/{_input:bnnn}.outlier_removed.tpm.gct.gz'
task: trunk_workers = 1, trunk_size = job_size, walltime = walltime, mem = mem, cores = numThreads
bash: expand= "${ }", stderr = f'{_output:nnn}.stderr', stdout = f'{_output:nnn}.log', container = container, entrypoint = entrypoint
    Rscript ${modular_script_dir}/molecular_phenotypes/QC/bulk_expression_QC.R \
        --step qc_2 \
        --cwd "${cwd}" \
        --tpm-gct "${_input}" \
        --RLEFilterPercent ${RLEFilterPercent} \
        --DSFilterPercent ${DSFilterPercent} \
        --topk-genes ${topk_genes} \
        --cluster-percent ${cluster_percent} \
        --pvalue-cutoff ${pvalue_cutoff} \
        --cluster-level ${cluster_level} \
        --numThreads ${numThreads}

Step (added from upstream): qc_3#

[qc_3 (remove gene and samples from raw counts)]
stop_if(not counts_gct.is_file())
output: f'{cwd}/{_input:bnnn}.geneCount.gct.gz'
task: trunk_workers = 1, trunk_size = job_size, walltime = walltime, mem = mem, cores = numThreads
bash: expand= "${ }", stderr = f'{_output:nn}.stderr', stdout = f'{_output:nn}.stdout', container = container, entrypoint = entrypoint
    Rscript ${modular_script_dir}/molecular_phenotypes/QC/bulk_expression_QC.R \
        --step qc_3 \
        --cwd "${cwd}" \
        --tpm-gct "${_input}" \
        --counts-gct "${counts_gct}" \
        --numThreads ${numThreads}