Sample level RNA-seq quality control

Sample level RNA-seq quality control#

Description#

Quality control of the TPM matrices follows methods outlined by GTEx V8. First, genes are removed from the matrices if over 20% of samples have have a TPM expression level of 10% or less. Sample level filtering includes three checks to detect sample outliers. Samples are only removed if they are marked as outliers in all three checks.

The first looks at Relative Log Expression (RLE). It is assumed that most gene expression values in a sample should be near a mean value and only a few genes are differentially expressed. To calculate RLE, for each gene i, calculate its median in N samples as Medi. Then for each sample j and expression value eij, count the difference between eij and Medi (deij = eij-Medi). Then create boxplots for each sample based on deij and sort by interquartile range. Samples with larger interquartile ranges are more likely to be outliers.

The second quality control check looks at heirarchical clustering of samples. Samples are expected to have short distances between others and therefore should cluster homogeneously, so distant samples are expected to be outliers. Distance is calculated as 1-spearman correlation for heirarchical clustering. The top 100 genes sorted by variance are used to calculate Mahalanobis distance. Chi2 p-values are then calculated based on Mahalanobis distance. Clusters with 60% or more samples with Bonferroni corrected p-values less than 0.05 are marked as outliers.

The third and last quality control check looks at D-statistics which represent the average correlation between a sample’s expression and other sampels. Samples with low D-statistics are likely to be outliers.

Input#

Gene expression in TPM. A data table with gene ID as first column and each sample ID as a subsquent columns.

Output#

A set of three diagnostic plots from each of the outlier detection method
A TPM file with low expression genes, samples with missing data, and outliers removed
Raw gene count file will also be processed to remove the same individuals and genes, as determined by the QC on TPM data.

Minimal Working Example Steps#

vi. Multi-sample RNA-seq QC#

Timing: <15min

Implement RNA-seq QC that identifies and removes genes and samples from the TPM matrix

Note: We recommend using DSFilterPercent default value of 0.05. DSFilterPercent changed from default (0.05) to 0.1 here for testing with few samples.

sos run pipeline/bulk_expression_QC.ipynb qc \
    --cwd output/rnaseq \
    --tpm-gct data/rnaseq/bulk_rnaseq_tmp_matrix.bed \
    --counts-gct data/rnaseq/bulk_rnaseq_count_matrix.bed

INFO: Running basic check and low expression filtering: 
INFO: basic check and low expression filtering is completed.
INFO: basic check and low expression filtering output:   /restricted/projectnb/xqtl/xqtl_protocol/toy_xqtl_protocol/output/rnaseq/bulk_rnaseq_tmp_matrix.low_expression_filtered.tpm.gct.gz
INFO: Running remove outliers: 
INFO: remove outliers is completed.
INFO: remove outliers output:   /restricted/projectnb/xqtl/xqtl_protocol/toy_xqtl_protocol/output/rnaseq/bulk_rnaseq_tmp_matrix.low_expression_filtered.outlier_removed.tpm.gct.gz
INFO: Running remove gene and samples from raw counts: 
INFO: remove gene and samples from raw counts is completed.
INFO: remove gene and samples from raw counts output:   /restricted/projectnb/xqtl/xqtl_protocol/toy_xqtl_protocol/output/rnaseq/bulk_rnaseq_tmp_matrix.low_expression_filtered.outlier_removed.geneCount.gct.gz
INFO: Workflow qc (ID=w533b8add03a0af8a) is executed successfully with 3 completed steps.

Troubleshooting#

Step	Substep	Problem	Possible Reason	Solution

Command interface#

sos run bulk_expression_QC.ipynb -h

usage: sos run bulk_expression_QC.ipynb
               [workflow_name | -t targets] [options] [workflow_options]
  workflow_name:        Single or combined workflows defined in this script
  targets:              One or more targets to generate
  options:              Single-hyphen sos parameters (see "sos run -h" for details)
  workflow_options:     Double-hyphen workflow-specific parameters

Workflows:
  qc

Global Workflow Options:
  --tpm-gct VAL (as path, required)
                        Required input is TPM file
  --counts-gct . (as path)
                        Raw counts file is optional and if available, it will be
                        filtered to match with the TPM file sample and genes
  --cwd output (as path)
  --container ''

Sections
  qc_1:
    Workflow Options:
      --low-expr-TPM 0.1 (as float)
      --low-expr-TPM-percent 0.2 (as float)
  qc_2:
    Workflow Options:
      --RLEFilterPercent 0.05 (as float)
      --DSFilterPercent 0.05 (as float)
      --topk-genes 100 (as int)
      --cluster-percent 0.6 (as float)
      --pvalue-cutoff 0.05 (as float)
      --cluster-level 5 (as int)
  qc_3:

Workflow implementation#

[global]
parameter: renovated_code_dir = path('renovated_code/script')  # override with --renovated-code-dir
# Required input is TPM file
parameter: tpm_gct = path
# Raw counts file is optional and if available, it will be filtered to match with the TPM file sample and genes
parameter: counts_gct = path()
parameter: cwd = path("output")
parameter: container = ""
import re
parameter: entrypoint= ('micromamba run -a "" -n' + ' ' + re.sub(r'(_apptainer:latest|_docker:latest|\.sif)$', '', container.split('/')[-1])) if container else ""
cwd = path(f'{cwd:a}')
# For cluster jobs, number commands to run per job
parameter: job_size = 1
# Wall clock time expected
parameter: walltime = "5h"
# Memory expected
parameter: mem = "16G"

# Number of threads
parameter: numThreads = 8

[qc_1 (basic check and low expression filtering)]
parameter: low_expr_TPM = 0.1
parameter: low_expr_TPM_percent = 0.2
input: tpm_gct
output: f'{cwd}/{_input:bnnn}.low_expression_filtered.tpm.gct.gz'
task: trunk_workers = 1, trunk_size = job_size, walltime = walltime, mem = mem, cores = numThreads
bash: expand= "${ }", stderr = f'{_output:nnn}.stderr', stdout = f'{_output:nnn}.log', container = container, entrypoint = entrypoint
    Rscript ${renovated_code_dir}/molecular_phenotypes/QC/bulk_expression_QC.R \
        --step qc_1 \
        --cwd "${cwd}" \
        --tpm-gct "${_input}" \
        --low-expr-TPM ${low_expr_TPM} \
        --low-expr-TPM-percent ${low_expr_TPM_percent} \
        --numThreads ${numThreads}

CAUTION: I (GW) am not sure what to do with the offset for log-transformation on TPM. GTEx suggests using offset = 1. Here I keep the recommendation from these authors where both 0.0001 and 1 were used in different steps.

[qc_2 (remove outliers)]
parameter: RLEFilterPercent = 0.05
parameter: DSFilterPercent = 0.05
parameter: topk_genes = 100
parameter: cluster_percent = 0.6
parameter: pvalue_cutoff = 0.05
parameter: cluster_level = 5
output: f'{cwd}/{_input:bnnn}.outlier_removed.tpm.gct.gz'
task: trunk_workers = 1, trunk_size = job_size, walltime = walltime, mem = mem, cores = numThreads
bash: expand= "${ }", stderr = f'{_output:nnn}.stderr', stdout = f'{_output:nnn}.log', container = container, entrypoint = entrypoint
    Rscript ${renovated_code_dir}/molecular_phenotypes/QC/bulk_expression_QC.R \
        --step qc_2 \
        --cwd "${cwd}" \
        --tpm-gct "${_input}" \
        --RLEFilterPercent ${RLEFilterPercent} \
        --DSFilterPercent ${DSFilterPercent} \
        --topk-genes ${topk_genes} \
        --cluster-percent ${cluster_percent} \
        --pvalue-cutoff ${pvalue_cutoff} \
        --cluster-level ${cluster_level} \
        --numThreads ${numThreads}

Additional formatting:

Filter out the geneCount table based on TPM table.
Adds two comment lines above the header of TPM and geneCount table to mimick the original output from RNASeQC.

[qc_3 (remove gene and samples from raw counts)]
stop_if(not counts_gct.is_file())
output: f'{cwd}/{_input:bnnn}.geneCount.gct.gz'
task: trunk_workers = 1, trunk_size = job_size, walltime = walltime, mem = mem, cores = numThreads
bash: expand= "${ }", stderr = f'{_output:nn}.stderr', stdout = f'{_output:nn}.stdout', container = container, entrypoint = entrypoint
    Rscript ${renovated_code_dir}/molecular_phenotypes/QC/bulk_expression_QC.R \
        --step qc_3 \
        --cwd "${cwd}" \
        --tpm-gct "${_input}" \
        --counts-gct "${counts_gct}" \
        --numThreads ${numThreads}