Sample-level RNA-seq quality control#
Description#
Sample-level QC of a bulk RNA-seq expression matrix, following the GTEx V8 approach. It runs in two parts: qc_1 removes low-expression genes (genes expressed below the TPM threshold in too many samples), and qc_2 detects and removes outlier samples using three checks (a D-statistic histogram, a Relative Log Expression (RLE) plot, and hierarchical clustering). A raw count matrix, if provided, is filtered to the same genes and samples as the QC’d TPM matrix.
Timing: ~2-5 min on typical compute infrastructure.
Input Files#
Parameter |
Example (toy) |
Description |
|---|---|---|
|
|
Gene expression in TPM; gene ID in the first column, one column per sample. Required. |
|
|
Raw gene count matrix. Optional; if given it is filtered to match the QC’d TPM genes/samples. |
The toy inputs are symlinks under input/ pointing into the bundled toy dataset.
Output Files#
Written to --cwd (output/rnaseq/). The output basename is taken from the TPM input name truncated at the first dot, so here it is protocol_example:
File |
Description |
|---|---|
|
TPM matrix after low-expression gene filtering ( |
|
TPM matrix after outlier-sample removal ( |
|
Raw count matrix filtered to the same genes/samples. |
|
D-statistic histogram diagnostic plot. |
|
Relative Log Expression diagnostic plot. |
|
Sample clustering diagnostic plot. |
The outlier_removed.tpm.gct.gz file is the input to multi-sample normalization.
Steps#
Note: DSFilterPercent defaults to 0.05; for the small toy cohort it can be raised (e.g. 0.1) so a single sample is not over-flagged. --container '' makes the workflow use the locally installed tools.
sos run pipeline/bulk_expression_QC.ipynb qc \
--cwd output/rnaseq \
--tpm-gct input/rnaseq/protocol_example.rnaseq.tpm_matrix.bed \
--counts-gct input/rnaseq/protocol_example.rnaseq.count_matrix.bed
Command interface#
sos run pipeline/bulk_expression_QC.ipynb -h
Workflow implementation#
[global]
parameter: modular_script_dir = path('code/script') # override with --modular-script-dir
# Required input is TPM file
parameter: tpm_gct = path
# Raw counts file is optional and if available, it will be filtered to match with the TPM file sample and genes
parameter: counts_gct = path()
parameter: cwd = path("output")
parameter: container = ""
import re
parameter: entrypoint= ('micromamba run -a "" -n' + ' ' + re.sub(r'(_apptainer:latest|_docker:latest|\.sif)$', '', container.split('/')[-1])) if container else ""
cwd = path(f'{cwd:a}')
# For cluster jobs, number commands to run per job
parameter: job_size = 1
# Wall clock time expected
parameter: walltime = "5h"
# Memory expected
parameter: mem = "16G"
# Number of threads
parameter: numThreads = 8
[qc_1 (basic check and low expression filtering)]
parameter: low_expr_TPM = 0.1
parameter: low_expr_TPM_percent = 0.2
input: tpm_gct
output: f'{cwd}/{_input:bnnn}.low_expression_filtered.tpm.gct.gz'
task: trunk_workers = 1, trunk_size = job_size, walltime = walltime, mem = mem, cores = numThreads
bash: expand= "${ }", stderr = f'{_output:nnn}.stderr', stdout = f'{_output:nnn}.log', container = container, entrypoint = entrypoint
Rscript ${modular_script_dir}/molecular_phenotypes/QC/bulk_expression_QC.R \
--step qc_1 \
--cwd "${cwd}" \
--tpm-gct "${_input}" \
--low-expr-TPM ${low_expr_TPM} \
--low-expr-TPM-percent ${low_expr_TPM_percent} \
--numThreads ${numThreads}
CAUTION: I (GW) am not sure what to do with the offset for log-transformation on TPM. GTEx suggests using offset = 1. Here I keep the recommendation from these authors where both 0.0001 and 1 were used in different steps.
Anticipated Results#
The pipeline produces output files in the output/ subdirectory named after the workflow step. Verify success by checking that output files exist and are non-empty. See the Output section above for the expected file names and formats.
[qc_2 (remove outliers)]
parameter: RLEFilterPercent = 0.05
parameter: DSFilterPercent = 0.05
parameter: topk_genes = 100
parameter: cluster_percent = 0.6
parameter: pvalue_cutoff = 0.05
parameter: cluster_level = 5
output: f'{cwd}/{_input:bnnn}.outlier_removed.tpm.gct.gz'
task: trunk_workers = 1, trunk_size = job_size, walltime = walltime, mem = mem, cores = numThreads
bash: expand= "${ }", stderr = f'{_output:nnn}.stderr', stdout = f'{_output:nnn}.log', container = container, entrypoint = entrypoint
Rscript ${modular_script_dir}/molecular_phenotypes/QC/bulk_expression_QC.R \
--step qc_2 \
--cwd "${cwd}" \
--tpm-gct "${_input}" \
--RLEFilterPercent ${RLEFilterPercent} \
--DSFilterPercent ${DSFilterPercent} \
--topk-genes ${topk_genes} \
--cluster-percent ${cluster_percent} \
--pvalue-cutoff ${pvalue_cutoff} \
--cluster-level ${cluster_level} \
--numThreads ${numThreads}
Step (added from upstream): qc_3#
[qc_3 (remove gene and samples from raw counts)]
stop_if(not counts_gct.is_file())
output: f'{cwd}/{_input:bnnn}.geneCount.gct.gz'
task: trunk_workers = 1, trunk_size = job_size, walltime = walltime, mem = mem, cores = numThreads
bash: expand= "${ }", stderr = f'{_output:nn}.stderr', stdout = f'{_output:nn}.stdout', container = container, entrypoint = entrypoint
Rscript ${modular_script_dir}/molecular_phenotypes/QC/bulk_expression_QC.R \
--step qc_3 \
--cwd "${cwd}" \
--tpm-gct "${_input}" \
--counts-gct "${counts_gct}" \
--numThreads ${numThreads}