Quantifying expression from RNA-seq data

Quantifying expression from RNA-seq data#

Description#

Our pipeline follows the GTEx pipeline from the GTEx/TOPMed project for the quantification of expression from RNA-seq data. Either paired end or single end fastq.gz files may be used as the initial input. Different read strandedness options are possible, including rf, fr, unstranded or strand_missing, based on library types from Signal et al [cf. Signal et al (2022)]. Strand detection steps are included in this pipeline to automaticaly detect the strand of the input fastq file by levradging the gene count table ouptut from STAR. Read length data is required and is set to a default value of 100 for reads of zero length.

We recommend using fastqc to quality control reads, followed by the removal of adaptors with fastp if necessary. After quality control has been conducted, STAR may be used to align reads to the reference genome. We combine this mapping step with an additionaly quality control step using Picard to mark duplicate reads and collect other metrics.

Gene-level RNA expression is called with RNA-SeQC v2.4.2 using a reference gtf that has been collapsed to contain genes instead of gene transcripts [cf. DeLuca et al., Bioinformatics, 2012]. Reads are only included if they are uniquely mapped (mapping quality of 255 for STAR BAMs), if they are aligned in proper pairs, if the read alignment distance was less than or equal to six (i.e. alignments must not contain more than six non-reference bases), and if they are fully contained within exon boundaries. Reads overlapping introns are not included. Exon level read counts are also called by RNA-SeQC. If a read overlaps multiple exons, then a fractional value equal to the portion of the read contained within the exon is allotted. We call transcript-level RNA expression using RSEM v1.3.0 with a transcript reference gtf.

Input#

A tab delimited file containing sample read information with one line per sample. This file should include a header for the following columns:

ID - sample ID (required)
fq1 - name of the fastq file (required)
fq2 - name of the second fastq file of the sample if using paired end data (required only if using paired end data)
strand - the strandedness of the sample (optional)
Options include rf, fr, unstranded or strand_missing, based on library types from Signal et al [cf. Signal et al. 2022]. Strand detection steps are included in this pipeline to automaticaly detect the strand of the input fastq file by leveraging the gene count table ouptut from STAR.
read_length - the read length of the sample’s reads (optional)
If this is unknown, enter 0 and the pipeline will use a default read length of 100. This default can be changed using the --sjdbOverhang parameter. If none of the samples have read length information then please leave out the read_length column so that the pipeline uses the default length for each sample.

Output#

Gene and transcript expression matrices in addition to other intermediate files (see Command interface steps below)

Minimal Working Example Steps#

i. Perform data quality summary via `fastqc`#

Timing: <4 min

Generate fastqc report.

FIXME

sos run pipeline/RNA_calling.ipynb fastqc \
    --cwd output/rnaseq/fastqc \
    --sample-list data/fastq.list.txt \
    --data-dir data/fastq

INFO: Input samples are paired-end sequences.
INFO: Running fastqc: 
INFO: fastqc (index=0) is completed.
INFO: fastqc (index=2) is completed.
INFO: fastqc (index=3) is completed.
INFO: fastqc (index=1) is completed.
INFO: fastqc output:   /restricted/projectnb/xqtl/xqtl_protocol/toy_xqtl_protocol/output/rnaseq/fastqc/test_sample1.1_fastqc.html /restricted/projectnb/xqtl/xqtl_protocol/toy_xqtl_protocol/output/rnaseq/fastqc/test_sample1.1_fastqc.zip... (8 items in 4 groups)
INFO: Workflow fastqc (ID=w520ec78ac06c1533) is executed successfully with 1 completed step and 4 completed substeps.

ii. Cut adaptor (Optional)#

Timing: ~10 min

Trim adaptors with fastp. A new sample list will be generated witht the trimmed fastq files.

sos run pipeline/RNA_calling.ipynb fastp_trim_adaptor \
    --cwd output/rnaseq --sample-list data/fastq.list.txt \
    --data-dir data/fastq --STAR-index reference_data/STAR_Index/ \
    --gtf reference_data/Homo_sapiens.GRCh38.103.chr.reformatted.ERCC.gtf \
    --reference-fasta reference_data/GRCh38_full_analysis_set_plus_decoy_hla.noALT_noHLA_noDecoy_ERCC.fasta \
    --ref-flat reference_data/Homo_sapiens.GRCh38.103.chr.reformatted.ERCC.ref.flat

INFO: Input samples are paired-end sequences.
INFO: Running fastp_trim_adaptor_1: 
INFO: fastp_trim_adaptor_1 (index=1) is completed.
INFO: fastp_trim_adaptor_1 (index=0) is completed.
INFO: fastp_trim_adaptor_1 output:   /restricted/projectnb/xqtl/xqtl_protocol/toy_xqtl_protocol/output/rnaseq/test_sample1.1.trimmed.fq.gz /restricted/projectnb/xqtl/xqtl_protocol/toy_xqtl_protocol/output/rnaseq/test_sample1.2.trimmed.fq.gz... (4 items in 2 groups)
INFO: Running fastp_trim_adaptor_2: 
INFO: fastp_trim_adaptor_2 is completed.
INFO: fastp_trim_adaptor_2 output:   /restricted/projectnb/xqtl/xqtl_protocol/toy_xqtl_protocol/output/rnaseq/data/fastq.list.trimmed.txt
INFO: Workflow fastp_trim_adaptor (ID=w4c40fbcf48b0752a) is executed successfully with 2 completed steps and 3 completed substeps.

iii. Read alignment via STAR and QC via Picard#

Timing <2 hours

Align the reads with STAR and generate the bam_list recipe for downstream molecular phenotype count matrixes. This step also checks the standedness of the data and removes duplicates with Picard. If –varVCFfile is provided, VW tags will be added to the alignments in [STAR_align_1], and filtering based on these tags will be performed in [filter_reads]. The –wasp option applies WASP correction, while –unique enables unique filtering. Using both options applies both filters, while omitting them means no filtering is applied. Note that filtering is not required when quantifying expression for eQTLs.

# STAR alignment without adding tags for filters(of course without filter): the execution speed will also be shorter than STARalignment without filter 
sos run pipeline/RNA_calling.ipynb STAR_align \
    --cwd output/rnaseq/bam --sample-list data/fastq.list.txt \
    --data-dir data/fastq --STAR-index reference_data/STAR_Index/ \
    --gtf reference_data/Homo_sapiens.GRCh38.103.chr.reformatted.ERCC.gtf \
    --reference-fasta reference_data/GRCh38_full_analysis_set_plus_decoy_hla.noALT_noHLA_noDecoy_ERCC.fasta \
    --ref-flat reference_data/Homo_sapiens.GRCh38.103.chr.reformatted.ERCC.ref.flat \
    --chimSegmentMin 0 \
    -J 50 --mem 200G --numThreads 8    

# STAR alignment with only WASP correction to diminish allel-specific bias:
sos run pipeline/RNA_calling.ipynb STAR_align \
    --cwd output/rnaseq/bam_wasp --sample-list data/fastq.list.txt \
    --data-dir data/fastq --STAR-index reference_data/STAR_Index/ \
    --gtf reference_data/Homo_sapiens.GRCh38.103.chr.reformatted.ERCC.gtf \
    --reference-fasta reference_data/GRCh38_full_analysis_set_plus_decoy_hla.noALT_noHLA_noDecoy_ERCC.fasta \
    --ref-flat reference_data/Homo_sapiens.GRCh38.103.chr.reformatted.ERCC.ref.flat \
    --varVCFfile reference_data/ZOD14598_AD_GRM_WGS_2021-04-29_all.recalibrated_variants.leftnorm.filtered.AF.WASP.vcf \
    --wasp yes \
    --chimSegmentMin 0 \
    -J 50 --mem 200G --numThreads 8

# STAR alignment with only Unique filtering to get unique reads:
sos run pipeline/RNA_calling.ipynb STAR_align \
    --cwd output/rnaseq/bam_unique --sample-list data/fastq.list.txt \
    --data-dir data/fastq --STAR-index reference_data/STAR_Index/ \
    --gtf reference_data/Homo_sapiens.GRCh38.103.chr.reformatted.ERCC.gtf \
    --reference-fasta reference_data/GRCh38_full_analysis_set_plus_decoy_hla.noALT_noHLA_noDecoy_ERCC.fasta \
    --ref-flat reference_data/Homo_sapiens.GRCh38.103.chr.reformatted.ERCC.ref.flat \
    --varVCFfile reference_data/ZOD14598_AD_GRM_WGS_2021-04-29_all.recalibrated_variants.leftnorm.filtered.AF.WASP.vcf \
    --unique yes \
    --chimSegmentMin 0 \
    -J 50 --mem 200G --numThreads 8

# STAR alignment with both WASP correction and unique filtering:
sos run pipeline/RNA_calling.ipynb STAR_align \
    --cwd output/rnaseq/bam_wasp_unique --sample-list data/fastq.list.txt \
    --data-dir data/fastq --STAR-index reference_data/STAR_Index/ \
    --gtf reference_data/Homo_sapiens.GRCh38.103.chr.reformatted.ERCC.gtf \
    --reference-fasta reference_data/GRCh38_full_analysis_set_plus_decoy_hla.noALT_noHLA_noDecoy_ERCC.fasta \
    --ref-flat reference_data/Homo_sapiens.GRCh38.103.chr.reformatted.ERCC.ref.flat \
    --varVCFfile reference_data/ZOD14598_AD_GRM_WGS_2021-04-29_all.recalibrated_variants.leftnorm.filtered.AF.WASP.vcf \
    --wasp yes \
    --unique yes \
    --chimSegmentMin 0 \
    -J 50 --mem 200G --numThreads 8

iv. Call gene-level RNA expression via rnaseqc#

Timing <30 min

Call gene-level RNA expression using rnaseqc and run multiqc. The gtf file must include gene-level data instead of transcript-level data. Add --varVCFfile if adding tags during STAR_align. Change your --cwd option for different filter choice.

sos run pipeline/RNA_calling.ipynb rnaseqc_call \
    --cwd data/bam \
    --sample-list data/fastq.list.txt \
    --data-dir data/fastq \
    --gtf reference_data/Homo_sapiens.GRCh38.103.chr.reformatted.collapse_only.gene.gtf \
    --reference-fasta reference_data/GRCh38_full_analysis_set_plus_decoy_hla.noALT_noHLA_noDecoy_ERCC.fasta \
    --varVCFfile reference_data/ZOD14598_AD_GRM_WGS_2021-04-29_all.recalibrated_variants.leftnorm.filtered.AF.WASP.vcf \
    --bam_list data/bam/sample_bam_list.txt

v. Call transcript level RNA expression via RSEM#

Timing <X hours

Call transcript level RNA expression using RSEM and run multiqc. The gtf file used as input should match the one used to generate the RSEM index and therefore contain transcript-level, not gene-level, information. Add --varVCFfile if adding tags during STAR_align. Change your --cwd option for different filter choice.

sos run pipeline/RNA_calling.ipynb rsem_call \
    --cwd data/bam \
    --sample-list data/fastq.list.txt \
    --data-dir data/fastq \
    --STAR-index reference_data/STAR_Index/ \
    --gtf reference_data/Homo_sapiens.GRCh38.103.chr.reformatted.ERCC.gtf \
    --reference-fasta reference_data/GRCh38_full_analysis_set_plus_decoy_hla.noALT_noHLA_noDecoy_ERCC.fasta \
    --ref-flat reference_data/Homo_sapiens.GRCh38.103.chr.reformatted.ERCC.ref.flat \
    --varVCFfile reference_data/ZOD14598_AD_GRM_WGS_2021-04-29_all.recalibrated_variants.leftnorm.filtered.AF.WASP.vcf \
    --bam_list data/bam/sample_bam_list.txt \
    --RSEM-index reference_data/RSEM_Index

Troubleshooting#

Step	Substep	Problem	Possible Reason	Solution

Command interface#

!sos run RNA_calling.ipynb -h

usage: sos run RNA_calling.ipynb [workflow_name | -t targets] [options] [workflow_options]
  workflow_name:        Single or combined workflows defined in this script
  targets:              One or more targets to generate
  options:              Single-hyphen sos parameters (see "sos run -h" for details)
  workflow_options:     Double-hyphen workflow-specific parameters

Workflows:
  bam_to_fastq
  fastqc
  fastp_trim_adaptor
  trimmomatic_trim_adaptor
  STAR_align
  strand_detected
  picard_qc
  rnaseqc_call
  rsem_call
  filter_reads

Global Workflow Options:
  --cwd output (as path)
                        The output directory for generated files.
  --sample-list VAL (as path, required)
                        Sample meta data list
  --sample-name  (as list)
                        Sample names to analyze
  --data-dir  path(f"{sample_list:d}")

                        Raw data directory, default to the same directory as
                        sample list
  --job-size 1 (as int)
                        For cluster jobs, number commands to run per job
  --walltime 5h
                        Wall clock time expected
  --mem 16G
                        Memory expected
  --java-mem 6G
                        Memory for Java virtual mechine (`picard`)
  --numThreads 8 (as int)
                        Number of threads
  --container ''
                        Software container option
  --varVCFfile ''
                        VarVCFfile for data preparation for wasp_filtering
  --entrypoint  ('micromamba run -a "" -n' + ' ' + re.sub(r'(_apptainer:latest|_docker:latest|\.sif)$', '', container.split('/')[-1])) if container else ""

  --[no-]uncompressed (default to False)
                        Whether the fasta/fastq file is compressed or not.

Sections
  bam_to_fastq:
  fastqc:
  fastp_trim_adaptor_1:
    Workflow Options:
      --window-size 4 (as int)
                        sliding window setting
      --required-quality 20 (as int)
      --leading 20 (as int)
                        the mean quality requirement option for cut_front
      --trailing 20 (as int)
                        the mean quality requirement option for cut_tail
      --min-len 15 (as int)
                        reads shorter than length_required will be discarded
      --fasta-with-adapters-etc . (as path)
                        Path to the reference adaptors
  fastp_trim_adaptor_2:
  trimmomatic_trim_adaptor:
    Workflow Options:
      --fasta-with-adapters-etc . (as path)
                        Illumina clip setting Path to the reference adaptors
      --seed-mismatches 2 (as int)
      --palindrome-clip-threshold 30 (as int)
      --simple-clip-threshold 10 (as int)
      --window-size 4 (as int)
                        sliding window setting
      --required-quality 20 (as int)
      --leading 3 (as int)
                        Other settings
      --trailing 3 (as int)
      --min-len 50 (as int)
  STAR_align_1:
    Workflow Options:
      --gtf VAL (as path, required)
                        Reference gene model
      --STAR-index VAL (as path, required)
                        STAR indexing file
      --outFilterMultimapNmax 20 (as int)
      --alignSJoverhangMin 8 (as int)
      --alignSJDBoverhangMin 1 (as int)
      --outFilterMismatchNmax 999 (as int)
      --outFilterMismatchNoverLmax 0.1 (as float)
      --alignIntronMin 20 (as int)
      --alignIntronMax 1000000 (as int)
      --alignMatesGapMax 1000000 (as int)
      --outFilterType BySJout
      --outFilterScoreMinOverLread 0.33 (as float)
      --outFilterMatchNminOverLread 0.33 (as float)
      --limitSjdbInsertNsj 1200000 (as int)
      --outSAMstrandField intronMotif
      --outFilterIntronMotifs None
      --alignSoftClipAtReferenceEnds Yes
      --quantMode TranscriptomeSAM GeneCounts (as list)
      --outSAMattrRGline ID:rg1 SM:sm1 (as list)
      --outSAMattributes NH HI AS nM NM ch (as list)
      --chimSegmentMin 15 (as int)
      --chimJunctionOverhangMin 15 (as int)
      --chimOutType Junctions WithinBAM SoftClip (as list)
      --chimMainSegmentMultNmax 1 (as int)
      --sjdbOverhang 100 (as int)
      --mapping-quality 255 (as int)
  strand_detected_1:
  strand_detected_2:
    Workflow Options:
      --strand ''
  picard_qc, STAR_align_2:
    Workflow Options:
      --gtf VAL (as path, required)
                        Reference gene model
      --ref-flat VAL (as path, required)
                        Path to flat reference file, for computing QC metric
      --reference-fasta VAL (as path, required)
                        The fasta reference file used to generate star index
      --optical-distance 100 (as int)
                        For the patterned flowcell models (HiSeq X), change to
                        2500
      --[no-]zap-raw-bam (default to False)
  STAR_align_3:
  rnaseqc_call_1:
    Workflow Options:
      --bam-list VAL (as path, required)
      --gtf VAL (as path, required)
                        Reference gene model
      --detection-threshold 5 (as int)
      --mapping-quality 255 (as int)
  rnaseqc_call_2:
    Workflow Options:
      --bam-list VAL (as path, required)
  rsem_call_1:
    Workflow Options:
      --bam-list VAL (as path, required)
      --RSEM-index VAL (as path, required)
      --max-frag-len 1000 (as int)
      --[no-]estimate-rspd (default to True)
  rsem_call_2:
    Workflow Options:
      --bam-list VAL (as path, required)
  rsem_call_3, rnaseqc_call_3:
  rsem_call_4, rnaseqc_call_4:
  filter_reads:
    Workflow Options:
      --unique ''
      --wasp ''
      --mapping-quality 255 (as int)

Setup and global parameters#

[global]
# The output directory for generated files.
parameter: cwd = path("output")
# Sample meta data list
parameter: sample_list = path
# Sample names to analyze
parameter: sample_name = list() #input should be --sample_name sample1 sample2, if multiple samples
# Raw data directory, default to the same directory as sample list
parameter: data_dir = path(f"{sample_list:d}")
# For cluster jobs, number commands to run per job
parameter: job_size = 1
# Wall clock time expected
parameter: walltime = "5h"
# Memory expected
parameter: mem = "16G"
# Memory for Java virtual mechine (`picard`)
parameter: java_mem = "6G"
# Number of threads
parameter: numThreads = 8
# Software container option
parameter: container = ""
# VarVCFfile for data preparation for wasp_filtering
parameter: varVCFfile = ""
import re
parameter: entrypoint= ('micromamba run -a "" -n' + ' ' + re.sub(r'(_apptainer:latest|_docker:latest|\.sif)$', '', container.split('/')[-1])) if container else ""
from sos.utils import expand_size
cwd = path(f'{cwd:a}')
## Whether the fasta/fastq file is compressed or not.
parameter: uncompressed = False
import os
import pandas as pd
## FIX: The way to get sample needs to be revamped to 1. Accomodate rf/fr as column 2. accomodate single end read (Only 2 fq/samples)
sample_inv = pd.read_csv(sample_list,sep = "\t")

# Align specific sample, to ensure correct grouping in subsequent steps, perform this step prior to extracting strand, read_length, and sample_id information
if len(sample_name)>0:
    print("Align sample",sample_name, 'only...')
    sample_inv = sample_inv[sample_inv['ID'].isin(sample_name)] 
    
## Extract strand information if user have specified the strand
strand_inv = []

if "strand" in sample_inv.columns:
    strand_inv = sample_inv.strand.values.tolist()
    sample_inv = sample_inv.drop("strand" , axis = 1)
    stop_if(not all([x in ["fr", "rf", "unstranded","strand_missing"] for x in strand_inv ]), msg = "strand columns should only include ``fr``, ``rf``, ``strand_missing`` or ``unstranded``")
            
## Extract read_length if user have specified read_length
read_length = [0] * len(sample_inv.index)
if "read_length" in sample_inv.columns:
    read_length = sample_inv.read_length.values.tolist()
    sample_inv = sample_inv.drop("read_length" , axis = 1)
    

## Extract sample_id
sample_inv_list = sample_inv.values.tolist()
sample_id = [x[0] for x in sample_inv_list]

    
## Get the file name for single/paired end data
file_inv = [x[1:] for x in sample_inv_list]
file_inv = [item for sublist in file_inv for item in sublist]

raw_reads = [f'{data_dir}/{x}' for x in file_inv]


for y in raw_reads:
        if not os.path.isfile(y):
            raise ValueError(f"File {y} does not exist")

if len(raw_reads) != len(set(raw_reads)):
        raise ValueError("Duplicated files are found (but should not be allowed) in sample file list")

# Is the RNA-seq data pair-end
is_paired_end = 0 if len(raw_reads) == len(sample_id) else 1 
from sos.utils import env
env.logger.info(f'Input samples are {"paired-end" if is_paired_end else "single-end"} sequences.')

Step 0: Convert BAM back to fastq if necessary#

[bam_to_fastq]
input: raw_reads, group_by = 1
output: f'{cwd}/{_input:bn}.1.fastq',f'{cwd}/{_input:bn}.2.fastq'
task: trunk_workers = 1, trunk_size = job_size, walltime = walltime, mem = mem, cores = numThreads
bash: expand= "${ }", stderr = f'{_output[0]:n}.stderr', stdout = f'{_output[0]:n}.stdout', container=container, entrypoint=entrypoint
    samtools sort -n ${_input} -o ${_output[0]:nn}.sorted.bam
    bedtools bamtofastq -i ${_output[0]:nn}.sorted.bam -fq ${_output[0]} -fq2 ${_output[1]}

Step 1: QC before alignment#

This step utilize fastqc and will generate two QC report in html format. It should be noted that the paired_end reads will also be qc separately

Step Inputs#

fastq1 and fastq2: paths to original fastq.gz file.

Step Outputs#

Two html file for QC report

[fastqc]
input: raw_reads, group_by =  1
output: f'{cwd}/{_input:bn}_fastqc.html',f'{cwd}/{_input:bn}_fastqc.zip' 
task: trunk_workers = 1, trunk_size = job_size, walltime = walltime, mem = mem, cores = numThreads
bash: expand= "${ }", stderr = f'{_output[0]:n}.stderr', stdout = f'{_output[0]:n}.stdout', container=container, entrypoint=entrypoint
    fastqc ${_input} -o ${_output[0]:d}

Step 2: Remove adaptor through `fastp`#

Documentation: fastp

We use fastp in place of the Trimmomatic for fastp’s ability to detect adapter from the reads. It was a c++ command line tool published in Sept 2018. It will use the following algorithm to detect the adaptors:

The adapter-sequence detection algorithm is based on two assumptions: the first is that only one adapter exists in the data; the second is that adapter sequences exist only in the read tails. These two assumptions are valid for major next-generation sequencers like Illumina HiSeq series, NextSeq series and NovaSeq series. We compute the k-mer (k = 10) of first N reads (N = 1 M). From this k-mer, the sequences with high occurrence frequencies (>0.0001) are considered as adapter seeds. Low-complexity sequences are removed because they are usually caused by sequencing artifacts. The adapter seeds are sorted by its occurrence frequencies. A tree-based algorithm is applied to extend the adapter seeds to find the real complete adapter

It was demostrated that fastp can remove all the adaptor automatically and completely faster than Trimmomatic and cutadapt

Step Inputs#

fastq: 1 set of fq.gz file for each sample. (2 files if paired end, 1 file if single end )

Step Outputs#

1 set of fastq.gz file for alignment. (2 files if paired end, 1 file if single end )
The unpaired reads (i.e.where a read survived, but the partner read did not.) will be discarded by default as those were not used in the following steps. This feature can be added were it was needed.
1 set of html documenting the quality of input reads(1 html file and 1 json file)

Step options#

A few options were selected to be customizable so that fastp step can have the same level of flexiblity as the Trimmomatic step had. They are:

min_len: length_required, reads shorter than this will be discarded, default is 15. (int [=15])
window_size: cut_window_size, the window size option shared by cut_front, cut_tail or cut_sliding. Range: 1~1000, default: 4 (int [=4])
leading/trailing : cut_front_mean_quality/cut_tail_mean_quality, the mean quality requirement option for cut_front/cut_tail, which move a sliding window from front/tail, drop the bases in the window if its mean quality < threshold. Notice the choice of quality score in fastp (N=20) is a lot higher than that of trimmomatic (N=3). Default fastp setting is in line with that of cutadapt.

The default value are set to be the same as the default value of the fastp software.

[fastp_trim_adaptor_1]
# sliding window setting
parameter: window_size = 4
parameter: required_quality = 20
# the mean quality requirement option for cut_front
parameter: leading = 20
# the mean quality requirement option for cut_tail
parameter: trailing = 20
# reads shorter than length_required will be discarded
parameter: min_len = 15
# Path to the reference adaptors
parameter: fasta_with_adapters_etc = path(".")
warn_if(fasta_with_adapters_etc.is_file(),msg = "Use input fasta and adaptor detection of paired-end read was disabled" )

input: raw_reads, group_by = is_paired_end + 1 , group_with = "sample_id"
output: [f'{cwd}/{path(x):bn}.trimmed.fq.gz' for x in _input]
task: trunk_workers = 1, trunk_size = job_size, walltime = walltime, mem = mem, cores = numThreads
bash:  container=container,expand= "${ }", stderr = f'{_output[0]:n}.stderr', stdout = f'{_output[0]:n}.stdout', entrypoint=entrypoint
        fastp -i ${f'{_input[0]} -I {_input[1]}' if is_paired_end else _input} -o ${ f'{_output[0]} -O {_output[1]}' if is_paired_end else _output } \
            ${f'--adapter_fasta {fasta_with_adapters_etc}' if fasta_with_adapters_etc.is_file() else "--detect_adapter_for_pe"}  -V -h ${_output[0]:n}.html -j ${_output[0]:n}.json -w ${numThreads} \
            --length_required ${min_len}  -W ${window_size} -M ${required_quality} -5 -3 --cut_front_mean_quality ${leading} --cut_tail_mean_quality ${leading}

[fastp_trim_adaptor_2]
input: group_by = "all"
output: f'{cwd}/{sample_list:n}.trimmed.txt'
sample_inv_tmp = pd.read_csv(sample_list,sep = "\t")
import csv
if is_paired_end:
    sample_inv_tmp.fq1 = [f'{x:r}'.replace("'","") for x in _input][::2]
    sample_inv_tmp.fq2 = [f'{x:r}'.replace("'","") for x in _input][1::2]
else:
    sample_inv_tmp.fq1 = [f'{x:r}'.replace("'","") for x in _input]
sample_inv_tmp.to_csv(_output,sep = "\t",index = 0)

Step 2 Alternative: Remove adaptor through `Trimmomatic`#

Documentation: Trimmomatic

We have replaced this with fastp, see above, which performs better than Trimmomatic in terms of removing adaptors that Trimmomatic cannot detect. fastp can also automatically guess the adapter sequences from data and by default no adapter sequence is required for input to fastp.

Step Inputs#

software_dir: directory for the software
fasta_with_adapters_etc: filename for the adapter reference file. According to Trimmomatic documention,

As a rule of thumb newer libraries will use TruSeq3, but this really depends on your service provider. If you use FASTQC, the “Overrepresented Sequences” report can help indicate which adapter file is best suited for your data. “Illumina Single End” or “Illumina Paired End” sequences indicate single-end or paired-end TruSeq2 libraries, and the appropriate adapter files are TruSeq2-SE.fa and TruSeq2-PE.fa respectively. “TruSeq Universal Adapter” or “TruSeq Adapter, Index …” indicates TruSeq-3 libraries, and the appropriate adapter files are TruSeq3-SE.fa or TruSeq3-PE.fa, for single-end and paired-end data respectively. Adapter sequences for TruSeq2 multiplexed libraries, indicated by “Illumina Multiplexing …”, and the various RNA library preparations are not currently included.

We have fastqc workflow previously defined and executed. Users should decide what fasta adapter reference to use based on fastqc results (or their own knowledge).

Step Outputs#

Two paired fastq.gz file for alignment
Two unpaired fastq.gz

For single-ended data, one input and one output file are specified, plus the processing steps. For paired-end data, two input files are specified, and 4 output files, 2 for the ‘paired’ output where both reads survived the processing, and 2 for corresponding ‘unpaired’ output where a read survived, but the partner read did not.

You need to figure out from fastqc results what adapter reference sequence to use., eg --fasta_with_adapters_etc TruSeq3-PE.fa. These files can be downloaded from Trimmomatic github repo.

[trimmomatic_trim_adaptor]
# Illumina clip setting
# Path to the reference adaptors
parameter: fasta_with_adapters_etc = path(".")
parameter: seed_mismatches = 2
parameter: palindrome_clip_threshold = 30
parameter: simple_clip_threshold = 10
# sliding window setting
parameter: window_size = 4
parameter: required_quality = 20
# Other settings
parameter: leading = 3
parameter: trailing = 3
parameter: min_len = 50
input: raw_reads, group_by = is_paired_end + 1 , group_with = "sample_id"
output: ([ f'{cwd}/{_sample_id}_paired_{_input[0]:bn}.gz', f'{cwd}/{_sample_id}_unpaired_{_input[0]:bn}.gz',  f'{cwd}/{_sample_id}_paired_{_input[1]:bn}.gz',f'{cwd}/{_sample_id}_unpaired_{_input[1]:bn}.gz' ] if is_paired_end else f'{cwd}/{_sample_id}_trimmed_{_input:bn}.gz')
task: trunk_workers = 1, trunk_size = job_size, walltime = walltime, mem = mem, cores = numThreads
bash: container=container, expand= "${ }", stderr = f'{_output[0]:n}.stderr', stdout = f'{_output[0]:n}.stdout', entrypoint=entrypoint
    trimmomatic -Xmx${java_mem} ${"PE" if is_paired_end else "SE"}  -threads ${numThreads} \
        ${_input:r}  ${_output:r} \
        ILLUMINACLIP:${fasta_with_adapters_etc}:${seed_mismatches}:${palindrome_clip_threshold}:${simple_clip_threshold} \
        LEADING:${leading} TRAILING:${trailing} SLIDINGWINDOW:${window_size}:${required_quality} MINLEN:${min_len}

Step 3: Alignment through `STAR`#

Documentation : STAR and Script in docker

This step is the main step for STAR alignment.

Originally, the STAR alignment is implemented using the run_STAR.py command from gtex. However, in order to accomodate different read length among samples, We reimplement what was included in the wrapper, including the re-alignment of STAR unsorted bam using samtools to avoid high mem consumption.

Step Inputs#

paths to trimmed fastq.gz file from Step 1 as documented in the new sample list.
STAR_index: directory for the STAR aligment index

Step Outputs#

bam file output ${cwd}/{sample_id}.Aligned.sortedByCoord.bam, will be used in step 3 and 4
bam file output ${cwd}/{sample_id}.Aligned.toTranscriptome.bam, will be used in step 5

NB: “${}" and "$[]” are exchangeable as expension, just to keep them consistent throught the bash block.

[STAR_align_1]
# Reference gene model
parameter: gtf = path
# STAR indexing file
parameter: STAR_index = path
parameter: outFilterMultimapNmax = 20 
parameter: alignSJoverhangMin = 8 
parameter: alignSJDBoverhangMin = 1 
parameter: outFilterMismatchNmax = 999 
parameter: outFilterMismatchNoverLmax = 0.1 #larger than Yang's group (outFilterMismatchNoverReadLmax 0.04)
parameter: alignIntronMin = 20 
parameter: alignIntronMax = 1000000 
parameter: alignMatesGapMax = 1000000 
parameter: outFilterType =  "BySJout" 
parameter: outFilterScoreMinOverLread = 0.33 
parameter: outFilterMatchNminOverLread = 0.33 
parameter: limitSjdbInsertNsj = 1200000 
parameter: outSAMstrandField = "intronMotif" 
parameter: outFilterIntronMotifs = "None" 
parameter: alignSoftClipAtReferenceEnds = "Yes" 
parameter: quantMode = ["TranscriptomeSAM", "GeneCounts"]
parameter: outSAMattrRGline = ["ID:rg1", "SM:sm1"]
parameter: outSAMattributes = ["NH", "HI", "AS", "nM", "NM", "ch"] #(Yang's group: outSAMattributes NH HI AS nM XS vW). vW added is varVCFfile is set
parameter: chimSegmentMin = 15 
parameter: chimJunctionOverhangMin = 15 
parameter: chimOutType = ["Junctions", "WithinBAM", "SoftClip"]
parameter: chimMainSegmentMultNmax = 1 
parameter: sjdbOverhang = 100
parameter: mapping_quality = 255
if int(mem.replace("G","")) <  40:
    print("Insufficent memory for STAR, changing to 40G")
    star_mem = '40G'
else:
    star_mem = mem
if varVCFfile:
    if not path(varVCFfile).is_file():
        raise FileNotFoundError(f"Cannot find varVCFfile ``{varVCFfile}``")
    print("Adding wasp filter in STAR alignment")
# This option is commented out because it will force the downstream analysis to use 40G, which significantlly slow down the process.
input: raw_reads, group_by = is_paired_end + 1, group_with = {"sample_id","read_length"}
output: cord_bam = f'{cwd}/{_sample_id}.Aligned.sortedByCoord.out.bam',
        trans_bam = f'{cwd}/{_sample_id}.Aligned.toTranscriptome.out.bam'
if _read_length == 0:
    print("Using specified --sjdbOverhang as read length")
else:
    print("Using read length specified in the sample list")
task: trunk_workers = 1, trunk_size = job_size, walltime = walltime, mem = star_mem, cores = numThreads
bash: container=container, expand= "$[ ]", stderr = f'{_output[0]:n}.stderr', stdout = f'{_output[0]:n}.stdout', entrypoint=entrypoint
    
    # Remove previous output files
    rm -rf $[cwd]/$[_sample_id].*.out.*.gz
    rm -rf $[cwd]/$[_sample_id]._STARpass1

    set -e

    # Record start timestamp for alignment
    align_start_time=$(date +%s)

    # Run STAR alignment
    STAR --runMode alignReads \
         --runThreadN $[numThreads] \
         --genomeDir $[STAR_index] \
         --readFilesIn $[_input:r] \
         --readFilesCommand $["cat" if uncompressed else "zcat"] \
         --outFileNamePrefix $[_output[0]:nnnn]. \
         --outSAMstrandField $[outSAMstrandField] \
         --twopassMode Basic \
         --outFilterMultimapNmax $[outFilterMultimapNmax] \
         --alignSJoverhangMin $[alignSJoverhangMin] \
         --alignSJDBoverhangMin $[alignSJDBoverhangMin] \
         --outFilterMismatchNmax $[outFilterMismatchNmax] \
         --outFilterMismatchNoverLmax $[outFilterMismatchNoverLmax] \
         --alignIntronMin $[alignIntronMin] \
         --alignIntronMax $[alignIntronMax] \
         --alignMatesGapMax $[alignMatesGapMax] \
         --outFilterType $[outFilterType] \
         --outFilterScoreMinOverLread $[outFilterScoreMinOverLread] \
         --outFilterMatchNminOverLread $[outFilterMatchNminOverLread] \
         --limitSjdbInsertNsj $[limitSjdbInsertNsj] \
         --outFilterIntronMotifs $[outFilterIntronMotifs] \
         --alignSoftClipAtReferenceEnds $[alignSoftClipAtReferenceEnds] \
         --quantMode $[" ".join(quantMode)] \
         --outSAMtype BAM Unsorted \
         --outSAMunmapped Within \
         --genomeLoad NoSharedMemory \
         --chimSegmentMin $[chimSegmentMin] \
         --chimJunctionOverhangMin $[chimJunctionOverhangMin] \
         --chimOutType $[" ".join(chimOutType)] \
         --chimMainSegmentMultNmax $[chimMainSegmentMultNmax] \
         --chimOutJunctionFormat 0 \
         --outSAMattributes $[" ".join(outSAMattributes)] $["vW" if varVCFfile else ""] \
         --outSAMattrRGline $[" ".join(outSAMattrRGline)] \
         --sjdbOverhang $[sjdbOverhang if _read_length == 0 else _read_length ] \
         --sjdbGTFfile $[gtf] $[("--varVCFfile %s --waspOutputMode SAMtag" % varVCFfile) if varVCFfile else ""]

    # Record end timestamp for alignment and calculate runtime
    align_end_time=$(date +%s)
    align_runtime=$((align_end_time - align_start_time))

    # Remove temporary files
    rm -r $[_output[0]:nnnn]._STARgenome
    $[ f'rm -rf [_output[0]:nnnn]._STARtmp' if is_paired_end else ""]

    # Record start timestamp for sorting
    sort_start_time=$(date +%s)

    # Sort the aligned BAM file
    samtools sort --threads $[numThreads] -o $[_output[0]:nnnn].Aligned.sortedByCoord.out.bam $[_output[0]:nnnn].Aligned.out.bam
    rm $[_output[0]:nnnn].Aligned.out.bam

    # Record end timestamp for sorting and calculate runtime
    sort_end_time=$(date +%s)
    sort_runtime=$((sort_end_time - sort_start_time))

    # Index the final BAM file
    samtools index $[_output[0]]

    # Print runtime information
    echo "STAR alignment started at: $(date -d @$align_start_time)"
    echo "STAR alignment ended at: $(date -d @$align_end_time)"
    echo "STAR alignment runtime: $align_runtime seconds"
    echo ""
    echo "Sorting started at: $(date -d @$sort_start_time)"
    echo "Sorting ended at: $(date -d @$sort_end_time)"
    echo "Sorting runtime: $sort_runtime seconds"

[strand_detected_1: shared="step_strand_detected"]
input: output_from("STAR_align_1")["trans_bam"], group_with = "sample_id"
import pandas as pd
ReadsPerGene = pd.read_csv(f'{cwd}/{_sample_id}.ReadsPerGene.out.tab' , sep = "\t" , header = None )
ReadsPerGene_list = ReadsPerGene.loc[3::,].sum(axis = 0, numeric_only = True).values.tolist()
strand_percentage = [x/ReadsPerGene_list[0] for x in ReadsPerGene_list]
print(f'for sample {_sample_id}')
if strand_percentage[1] > 0.9:
    print(f'Counts for the 1st read strand aligned with RNA is {strand_percentage[1]}, > 90% of aligned count')
    print('Data is likely FR/fr-secondstrand')
    strand_detected = "fr"
elif  strand_percentage[2] > 0.9:
    print(f'Counts for the 2nd read strand aligned with RNA is {strand_percentage[2]}, > 90% of aligned count')
    print('Data is likely RF/fr-firststrand')
    strand_detected = "rf"
elif max( strand_percentage[1],  strand_percentage[2]) < 0.6:
    print(f'Both {strand_percentage[1]} and  {strand_percentage[2]} are under 60% of reads explained by one direction')
    print('Data is likely unstranded')
    strand_detected = "unstranded"
else:
    strand_detected = 'strand_missing' 
    print(f'Data does not fall into a likely stranded (max percent explained {max( strand_percentage[1],  strand_percentage[2])} > 0.9) or unstranded layout (max percent explained {max( strand_percentage[1],  strand_percentage[2])} < 0.6), please check your data and manually specified the ``--strand`` option as ``fr``, ``rf`` or ``unstranded``')

[strand_detected_2: shared="strand"]
input: group_by = "all"
parameter: strand = ""
import pandas as pd
if not strand:
    if len(strand_inv) > 0:
        strand = strand_inv
        for i in range(0,len(strand)):
            if strand[i] == "strand_missing":
                strand[i] = step_strand_detected[i]
        print(f'Using strand specified in the input samples list {strand}, replacing strand_missing with detected strand')
    else:
        warn_if(not all(x is step_strand_detected[0] for x in step_strand_detected), msg = "strands detected are different among samples, please check your protocol, we will use the detected strand for each samples")    
        strand = step_strand_detected
        print(f'Using detected strand for each samples {strand}')
else:
    stop_if(strand not in ["fr", "rf", "unstranded"], msg = "``--strand`` option should be ``fr``, ``rf`` or ``unstranded``")
    print(f'Using ``--strand`` overwrite option for all the samples {strand[0]}') 
    strand = [strand] * len(step_strand_detected)

Step 4: Mark duplicates reads & QC through `Picard`#

This step is the first QC step after STAR alignment. This step will performed QC to collect multipe metrics regarding the RNASeq using Picard. Then it will also generate a new .bam file with duplication marked with the hexadecimal value of 0x0400, which corresponds to a decimal value of 1024

Step Inputs:#

STAR_bam: path to the output in Step 2.
reference_fasta: The fasta reference file used to generate star index, it is critical that the this fasta is exactly the same as those that generate the STAR Index, else this error will occurs: statfungen/xqtl-protocol#357
RefFlat file

This file is needed for picard CollectRnaSeqMetrics module, which in turn

produces metrics describing the distribution of the bases within the transcripts. It calculates the total numbers and the fractions of nucleotides within specific genomic regions including untranslated regions (UTRs), introns, intergenic sequences (between discrete genes), and peptide-coding sequences (exons). This tool also determines the numbers of bases that pass quality filters that are specific to Illumina data (PF_BASES).

The refFlat file can be generated by the reference_data module, RefFlat_generation step.

Step Outputs:#

A collection of metrics file for each of the samples
A new .bam file with duplication marked with the hexadecimal value of 0x0400, which corresponds to a decimal value of 1024

[picard_qc, STAR_align_2]
# Reference gene model
parameter: gtf = path
depends: sos_variable('strand')
# Path to flat reference file, for computing QC metric
parameter: ref_flat = path
# The fasta reference file used to generate star index
parameter: reference_fasta = path
# For the patterned flowcell models (HiSeq X), change to 2500
parameter: optical_distance = 100
parameter: zap_raw_bam = False
picard_strand_dict = {"rf":"SECOND_READ_TRANSCRIPTION_STRAND","fr": "FIRST_READ_TRANSCRIPTION_STRAND","unstranded":"NONE" }
input: output_from('filter_reads'), group_by = 2, group_with = {"sample_id","strand"}
output: picard_metrics = f'{cwd}/{_sample_id}.alignment_summary_metrics',
        picard_rna_metrics = f'{cwd}/{_sample_id}.rna_metrics',
        md_bam = f'{_input[0]:n}.md.bam',
        md_metrics = f'{_input[0]:n}.md.metrics',
        bigwig = f'{_input[0]:n}.md.bw',
        output_summary = f'{cwd}/{_sample_id}.bam_file_meta'
task: trunk_workers = 1, trunk_size = job_size, walltime = walltime, mem = mem, cores = numThreads
bash: container=container, expand= "${ }", stderr = f'{_output[0]:n}.stderr', stdout = f'{_output[0]:n}.stdout', entrypoint=entrypoint
    # Record start timestamp for CollectMultipleMetrics
    start_time=$(date +%s)
    echo "CollectMultipleMetrics started at: $(date)"
    set -e
    picard -Xmx${java_mem} CollectMultipleMetrics \
        -REFERENCE_SEQUENCE ${reference_fasta} \
        -PROGRAM CollectAlignmentSummaryMetrics \
        -PROGRAM CollectInsertSizeMetrics \
        -PROGRAM QualityScoreDistribution \
        -PROGRAM MeanQualityByCycle \
        -PROGRAM CollectBaseDistributionByCycle \
        -PROGRAM CollectGcBiasMetrics \
        -VALIDATION_STRINGENCY STRICT \
        -INPUT  ${_input["cord_bam_wasp_qc"]} \
        -OUTPUT  ${_output[0]:n}
    
    # Record end timestamp and calculate runtime for CollectMultipleMetrics
    end_time=$(date +%s)
    runtime=$((end_time - start_time))
    echo "CollectMultipleMetrics ended at: $(date)"
    echo "CollectMultipleMetrics runtime: $runtime seconds"

bash: container=container, expand= "${ }", stderr = f'{_output[-1]:n}.stderr', stdout = f'{_output[-1]:n}.stdout', entrypoint=entrypoint
    start_time=$(date +%s)
    echo "MarkDuplicates started at: $(date)"
    set -e
    picard -Xmx${java_mem}  MarkDuplicates \
        -I ${_input["cord_bam_wasp_qc"]}  \
        -O ${_output[2]} \
        -PROGRAM_RECORD_ID null \
        -M ${_output[3]} \
        -TMP_DIR ${cwd}\
        -MAX_RECORDS_IN_RAM 500000 -SORTING_COLLECTION_SIZE_RATIO 0.25 \
        -ASSUME_SORT_ORDER coordinate \
        -TAGGING_POLICY DontTag \
        -OPTICAL_DUPLICATE_PIXEL_DISTANCE ${optical_distance} \
        -CREATE_INDEX true \
        -CREATE_MD5_FILE true \
        -VALIDATION_STRINGENCY STRICT \
        -REMOVE_SEQUENCING_DUPLICATES false \
        -REMOVE_DUPLICATES false
    samtools index ${_output[2]}
    # Record end timestamp and calculate runtime for MarkDuplicates
    end_time=$(date +%s)
    runtime=$((end_time - start_time))
    echo "MarkDuplicates ended at: $(date)"
    echo "MarkDuplicates runtime: $runtime seconds"

bash: container=container, expand= "${ }", stderr = f'{_output[0]:n}.rna_metrics.stderr', stdout = f'{_output[0]:n}.rna_metrics.stdout', entrypoint=entrypoint
    set -e
    # Get only lines with rRNA and transcript_id from the GTF file
    cat ${gtf} | grep rRNA | grep transcript_id > ${cwd}/$(basename ${gtf}).tmp.${_input["cord_bam_wasp_qc"]:bnnnn}  # To avoid data racing problem, modification was made here to change the output location to the output path
    
    # Extract header from the BAM file
    samtools view -H ${_input["cord_bam_wasp_qc"]}  > ${_input["cord_bam_wasp_qc"]}.RI

python: container=container, expand= "${ }", stderr = f'{_output[0]:n}.rna_metrics.stderr', stdout = f'{_output[0]:n}.rna_metrics.stdout', entrypoint=entrypoint
        import os
        import pandas as pd
        from collections import defaultdict
        chrom = []
        start = []
        end = []
        strand = []
        tag = []
        filename = os.path.basename("${gtf}.tmp.${_input["cord_bam_wasp_qc"]:bnnnn}")
        annotation_gtf = os.path.join("${cwd}",filename) # also change the path to read the file only with rRNA and transcript_id from the GTF file

        with open(annotation_gtf, 'r') as gtf:
            for row in gtf:
                row = row.strip().split('\t')
                if row[0][0]=='#' or row[2]!="transcript": continue # skip header
                chrom.append(row[0])
                start.append(row[3])
                end.append(row[4])
                strand.append(row[6])
                attributes = defaultdict()
                for a in row[8].replace('"', '').split(';')[:-1]:
                    kv = a.strip().split(' ')
                    if kv[0]!='tag':
                        attributes[kv[0]] = kv[1]
                    else:
                        attributes.setdefault('tags', []).append(kv[1])
                tag.append(attributes)
        transcript_id = [x["transcript_id"] for x in tag]
        RI = pd.DataFrame(data={'chr':chrom, 'start':start, 'end':end, 'strand':strand,'transcript_id' : transcript_id })
        RI.to_csv("${_input["cord_bam_wasp_qc"]}.RI", index = 0, header = 0, mode = "a",sep = "\t" )

bash: container=container, expand= "${ }", stderr = f'{_output[0]:n}.rna_metrics.stderr', stdout = f'{_output[0]:n}.rna_metrics.stdout', entrypoint=entrypoint
    # Record start timestamp for CollectRnaSeqMetrics
    start_time=$(date +%s)
    echo "CollectRnaSeqMetrics started at: $(date)"
    set -e
    picard -Xmx${java_mem} CollectRnaSeqMetrics \
        -REF_FLAT ${ref_flat} \
        -RIBOSOMAL_INTERVALS ${_input["cord_bam_wasp_qc"]}.RI \
        -STRAND_SPECIFICITY ${picard_strand_dict[_strand]} \
        -CHART_OUTPUT ${_output[0]:n}.rna_metrics.pdf \
        -VALIDATION_STRINGENCY STRICT \
        -INPUT ${_input["cord_bam_wasp_qc"]}  \
        -OUTPUT  ${_output[0]:n}.rna_metrics
    
    rm ${cwd}/$(basename ${gtf}).tmp.${_input["cord_bam_wasp_qc"]:bnnnn} # remove the path of the corresponding file with only rRNA and transcript_id from the GTF file
    
    # Record end timestamp and calculate runtime for CollectRnaSeqMetrics
    end_time=$(date +%s)
    runtime=$((end_time - start_time))
    echo "CollectRnaSeqMetrics ended at: $(date)"
    echo "CollectRnaSeqMetrics runtime: $runtime seconds"

bash: container=container, expand= "$[ ]", stderr = f'{_output[0]:n}.bw.stderr', stdout = f'{_output[0]:n}.bw.stdout', entrypoint=entrypoint
    generate_bigwig() {
        # Check if enough arguments are provided
        if [[ $# -ne 1 ]]; then
            echo "Usage: generate_bigwig <file.bam>"
            return 1
        fi

        local input_file="$1"
        
        # Derive file.bw from file.bam
        local bigwig_file="${input_file%.*}.bw"

        # Execute the commands
        samtools index "$input_file"
        bamCoverage -b "$input_file" -o "$bigwig_file"
    }

    export -f generate_bigwig  # To make it available in subshells, if needed.
    generate_bigwig $[_output["md_bam"]]

import pandas as pd
out = pd.DataFrame({
    "sample_id": _sample_id,
    "strand": _strand,
    "coord_bam_list": f'{_output["md_bam"]:b}',
    "BW_list": f'{_output["bigwig"]:b}',
    "SJ_list": f'{_output["md_bam"]:bnnnnn}.SJ.out.tab',
    "trans_bam_list": f'{_output["md_bam"]:bnnnn}.toTranscriptome.out{"_wasp_qc" if varVCFfile else ""}.bam'},
    index=[0])
out.to_csv(_output["output_summary"], sep="\t", index=False)

if zap_raw_bam:
    _input["cord_bam_wasp_qc"].zap()

[STAR_align_3]
input: group_by = "all"
depends: sos_variable("strand")
output: f'{cwd}/{sample_list:bn}.bam_file_list'
task: trunk_workers = 1, trunk_size = 1, walltime = walltime, mem = mem, cores = 1
python: container=container, expand= "${ }", entrypoint=entrypoint

    import pandas as pd

    coord_bam_list = [${_input:br,}][2::6]
    sorted_bigwig_list = [${_input:br,}][4::6]
    SJ_list = [f'{x}.SJ.out.tab' for x in [${_input:bnnnnnr,}][2::6]]
    trans_bam_list = [f'{x}.toTranscriptome.out{"_wasp_qc" if "${varVCFfile}" else ""}.bam' for x in [${_input:bnnnnr,}][2::6]]

    out = pd.DataFrame({
        "sample_id": ${sample_id},
        "strand": ${strand},
        "coord_bam_list": coord_bam_list,
        "BW_list": sorted_bigwig_list,
        "SJ_list": SJ_list,
        "trans_bam_list": trans_bam_list
    })

    out.to_csv("${_output}", sep="\t", index=False)

Step 5: Post aligment QC through `RNA-SeQC`#

Documentation : RNA-SeQC and Script in docker

This step is second QC step after STAR alignment. It will perform RNA-seq quantification as well.

Step Inputs#

bam_list: path to the output of STAR_output, which outlined the strands and bam file used for each samples.
gtf: reference genome .gtf file, this gtf file need to have the same chr name format as the index used to generate the bam file and must be on collapsed gene gtf

Step Outputs#

The following output files are generated in the output directory you provide:

{sample}.metrics.tsv : A tab-delimited list of (Statistic, Value) pairs of all statistics and metrics recorded.
{sample}.exon_reads.gct : A tab-delimited GCT file with (Exon ID, Gene Name, coverage) tuples for all exons which had at least part of one read mapped.
{sample}.gene_reads.gct : A tab-delimited GCT file with (Gene ID, Gene Name, coverage) tuples for all genes which had at least one read map to at least one of its exons
{sample}.gene_tpm.gct : A tab-delimited GCT file with (Gene ID, Gene Name, TPM) tuples for all genes reported in the gene_reads.gct file. Note: this file is renamed to .gene_rpkm.gct if the –rpkm flag is present.
{sample}.fragmentSizes.txt : A list of fragment sizes recorded, if a BED file was provided
{sample}.coverage.tsv : A tab-delimited list of (Gene ID, Transcript ID, Mean Coverage, Coverage Std, Coverage CV) tuples for all transcripts encountered in the GTF.

[rnaseqc_call_1]
import os
import pandas as pd
parameter: bam_list = path
# Reference gene model
parameter: gtf = path
sample_inv = pd.read_csv(bam_list,sep = "\t")
parameter: detection_threshold = 5
parameter: mapping_quality = 255
## Extract strand information if user have specified the strand
strand_list = sample_inv.strand.values.tolist()
stop_if(not all([x in ["fr", "rf", "unstranded"] for x in strand_list ]), msg = "strand columns should only include ``fr``, ``rf`` or ``unstranded``, please check the bam_list")     
## Extract sample_id
sample_id = sample_inv.sample_id.values.tolist()
    
## Get the file name for cood_bam data
coord_bam_list_inv = sample_inv.coord_bam_list.values.tolist()
coord_bam_list = [f'{cwd}/{x}' for x in coord_bam_list_inv ]
input:  coord_bam_list, group_by = 1, group_with = {"sample_id","strand_list"}
output: f'{cwd}/{_sample_id}.rnaseqc.gene_tpm.gct.gz',
        f'{cwd}/{_sample_id}.rnaseqc.gene_reads.gct.gz',
        f'{cwd}/{_sample_id}.rnaseqc.exon_reads.gct.gz',
        f'{cwd}/{_sample_id}.rnaseqc.metrics.tsv'

task: trunk_workers = 1, trunk_size = job_size, walltime = walltime, mem = mem, cores = numThreads
bash: container=container, expand= "${ }", stderr = f'{_output[0]:n}.stderr', stdout = f'{_output[0]:n}.stdout', entrypoint=entrypoint
    cd ${cwd} && \
    rnaseqc \
        ${gtf:a} \
        ${_input:a} \
        ${_output[0]:d}  \
        ${("--stranded " + _strand_list) if _strand_list != "unstranded" else ""} \
        ${f'--detection-threshold {detection_threshold}'} ${f'--mapping-quality {mapping_quality}'} ${ '' if is_paired_end else '-u'}

bash: container=container, expand= "${ }", stderr = f'{_output[0]:n}.stderr', stdout = f'{_output[0]:n}.stdout', entrypoint=entrypoint
    mv ${_output[0]:d}/${_input:b}.gene_tpm.gct ${_output[0]:n}
    mv ${_output[0]:d}/${_input:b}.gene_reads.gct ${_output[1]:n}
    mv ${_output[0]:d}/${_input:b}.exon_reads.gct ${_output[2]:n}
    mv ${_output[0]:d}/${_input:b}.metrics.tsv ${_output[3]}
    gzip ${_output[0]:n} ${_output[1]:n} ${_output[2]:n}

The RNASEQC results were merged in the following step,

[rnaseqc_call_2]
parameter: bam_list = path
input: group_by = "all"
output: f'{cwd}/{bam_list:bn}.rnaseqc.gene_tpm.gct.gz',
        f'{cwd}/{bam_list:bn}.rnaseqc.gene_readsCount.gct.gz',
        f'{cwd}/{bam_list:bn}.rnaseqc.exon_readsCount.gct.gz',
        f'{cwd}/{bam_list:bn}.rnaseqc.metrics.tsv'
task: trunk_workers = 1, trunk_size = job_size, walltime = walltime, mem = mem, cores = numThreads
python: container=container, expand= "${ }", stderr = f'{_output[0]:n}.stderr', stdout = f'{_output[0]:n}.stdout', entrypoint=entrypoint
    import pandas as pd
    import os
    def make_gct(gct_path):
        # sample_name
        sample_name = ".".join(os.path.basename(gct_path).split(".")[:-4])
        # read_input
        pre_gct = pd.read_csv(gct_path,sep = "\t",
                              skiprows= 2,index_col="Name").drop("Description",axis = 1)
        pre_gct.index.name = "gene_ID"
        pre_gct.columns = [sample_name]
        return(pre_gct)

    def merge_gct(gct_path_list):
        gct = pd.DataFrame()
        for gct_path in gct_path_list:
            #check duplicated indels and remove them.
            gct_col = make_gct(gct_path)
            gct = gct.merge(gct_col,right_index=True,left_index = True,how = "outer")
        return gct

    input_list = [${_input:r,}]
    tpm_list = input_list[0::4]
    gc_list = input_list[1::4]
    ec_list = input_list[2::4]
    gct_path_list_list = [tpm_list,gc_list,ec_list]
    output_path = [${_output:r,}][0:3]
    for i in range(0,len(output_path)):
        output = merge_gct(gct_path_list_list[i])
        output.to_csv(output_path[i], sep = "\t")
    metrics_list = input_list[3::4]
    with open("${cwd}/${bam_list:bn}.rnaseqc.metrics_output_list", "w") as f:
        f.write('\n'.join(metrics_list))


    input_files_tsv = "${_output[3]:n}_output_list"
    prefix = "${_output[3]:nn}"




    annotation_headers = ""

    annotation_tsvs = []
    output_dir = "."


    if len(annotation_headers)==1 and annotation_headers[0]=='':
        annotation_headers = []

    path_s = pd.read_csv(input_files_tsv, sep='\t', index_col=0, header=None, names=['sample_id','metrics_path']).squeeze("columns")
    if path_s.isnull().all():  # ID not provided
        path_s = pd.Series(path_s.index, index=[os.path.split(i)[1].split('.metrics.tsv')[0] for i in path_s.index])

    # check format
    df = pd.read_csv(path_s.iloc[0], sep='\t', header=None)
    if df.shape[0]==2:  # RNA-SeQC v1.1.9
        dfs = [pd.read_csv(i, sep='\t') for i in path_s]
    elif df.shape[1]==2:  # RNA-SeQC v2
        dfs = [pd.read_csv(i, sep='\t', header=None, index_col=0).T for i in path_s]
    else:
        raise ValueError('Unrecognized input format (shape {}).'.format(df.shape))
    metrics_df = pd.concat(dfs, axis=0)
    metrics_df.index = metrics_df['Sample']

    # add optional annotations as additional columns
    for h,tsv in zip(annotation_headers, annotation_tsvs):
        annotation_df = pd.read_csv(tsv, sep='\t', index_col=0, header=None, names=['sample_id',h])
        metrics_df[h] = annotation_df.loc[metrics_df.index, h]

    metrics_df.to_csv(os.path.join(output_dir, prefix+'.metrics.tsv'), sep='\t', index=False, float_format='%.8g')
#bash: container=container, expand= "${ }", stderr = f'{_output[0]:n}.stderr', stdout = f'{_output[0]:n}.stdout', entrypoint=entrypoint
#    aggregate_rnaseqc_metrics  ${_output[3]:n}_output_list ${_output[3]:nn}

Step 6: Quantify expression through `RSEM`#

Documentation : RSEM and Script in docker

This step generate the expression matrix from STAR output. Estimate gene and isoform expression from RNA-Seq data are generated.

Step Input#

bam_list: path to the output of STAR_output, which outlined the strands and bam file used for each samples.
transcript-level BAM file: path to the output of Step 3.
RSEM_index: path to RSEM index

Step Outputs#

Please see the output section of https://deweylab.github.io/RSEM/rsem-calculate-expression.html

[rsem_call_1]
parameter: bam_list = path
parameter: RSEM_index = path
parameter: max_frag_len = 1000
parameter: estimate_rspd = True

sample_inv = pd.read_csv(bam_list,sep = "\t")

## Extract strand information if user have specified the strand
strand_list = sample_inv.strand.values.tolist()
stop_if(not all([x in ["fr", "rf", "unstranded"] for x in strand_list ]), msg = "strand columns should only include ``fr``, ``rf`` or ``unstranded``, please check the bam_list")     
## Extract sample_id
sample_id = sample_inv.sample_id.values.tolist()
    
## Get the file name for trans_bam_list data
trans_bam_list_inv = sample_inv.trans_bam_list.values.tolist()
trans_bam_list = [f'{cwd}/{x}' for x in trans_bam_list_inv]
input: trans_bam_list, group_by = 1, group_with = {"sample_id", "strand_list"} 
output: f'{cwd}/{_sample_id}.rsem.isoforms.results', f'{cwd}/{_sample_id}.rsem.genes.results',f'{cwd}/{_sample_id}.rsem.stat/{_sample_id}.rsem.cnt'
task: trunk_workers = 1, walltime = walltime, mem = mem, cores = numThreads, trunk_size = job_size
bash: container=container, expand= "${ }", stderr = f'{_output[0]:n}.stderr', stdout = f'{_output[0]:n}.stdout', entrypoint=entrypoint
    run_RSEM ${RSEM_index:a} ${_input:a} ${_sample_id} \
        -o ${_output[0]:d} \
        --max_frag_len ${max_frag_len} \
        --estimate_rspd ${'true' if estimate_rspd else 'false'} \
        --paired_end ${"true" if is_paired_end else "false"} \
        --is_stranded ${"true" if _strand_list != "unstranded" else "false"} \
        --threads ${numThreads}

The RSEM results were merged in the following steps, seven files (four for each columns in the isoform output and 3 for each of the genes output) will be generated.

[rsem_call_2]
parameter: bam_list = path
input: group_by = "all"
output: f'{cwd}/{bam_list:bn}.rsem_transcripts_expected_count.txt.gz',
        f'{cwd}/{bam_list:bn}.rsem_transcripts_tpm.txt.gz',
        f'{cwd}/{bam_list:bn}.rsem_transcripts_fpkm.txt.gz',
        f'{cwd}/{bam_list:bn}.rsem_transcripts_isopct.txt.gz',
        f'{cwd}/{bam_list:bn}.rsem_genes_expected_count.txt.gz',
        f'{cwd}/{bam_list:bn}.rsem_genes_tpm.txt.gz',
        f'{cwd}/{bam_list:bn}.rsem_genes_fpkm.txt.gz',
        f'{cwd}/{bam_list:bn}.rsem.aggregated_quality.metrics.tsv'

task: trunk_workers = 1, walltime = walltime, mem = mem, cores = numThreads, trunk_size = job_size
python: container=container, expand= "${ }", stderr = f'{_output[0]:n}.stderr', stdout = f'{_output[0]:n}.stdout', entrypoint=entrypoint
    input_list = [${_input:r,}]
    with open('${cwd}/${bam_list:bn}.rsem.isoforms_output_list', "w") as f:
        f.write('\n'.join(input_list[0::3]))
    with open('${cwd}/${bam_list:bn}.rsem.genes_output_list', "w") as f:
        f.write('\n'.join(input_list[1::3]))

bash: container=container, expand= "${ }", stderr = f'{_output[0]:n}.stderr', stdout = f'{_output[0]:n}.stdout', entrypoint=entrypoint
         aggregate_rsem_results ${cwd}/${bam_list:bn}.rsem.isoforms_output_list {expected_count,TPM,FPKM,IsoPct} ${_output[0]:nnn} -o ${cwd}
         aggregate_rsem_results ${cwd}/${bam_list:bn}.rsem.genes_output_list {expected_count,TPM,FPKM} ${_output[1]:nnn} -o ${cwd}
# -o to specify the output-dir(writable);the defualt is cwd of the sos command-/home/username/data(read-only)

R:  container=container, expand= "${ }", stderr = f'{_output[-1]:n}.stderr', stdout = f'{_output[-1]:n}.stdout', entrypoint=entrypoint
     readRSEM.cnt <- function (source) {
            # RSEM .cnt files gives statistics about the (transcriptome) alignment passed to RSEM:
            # Row 1: N0 (# unalignable reads);
            #        N1 (# alignable reads);
            #        N2 (# filtered reads due to too many alignments);
            #        N_tot (N0+N1+N2)
            # Row 2: nUnique (# reads aligned uniquely to a gene);
            #        nMulti (# reads aligned to multiple genes);
            #        nUncertain (# reads aligned to multiple locations in the given reference sequences, which include isoform-level multi-mapping reads)
            # Row 3: nHits (# total alignments);
            #        read_type (0: single-end read, no quality; 1: single-end read, with quality score; 2: paired-end read, no quality score; 3: paired-end read, with quality score)
            # Source: https://groups.google.com/forum/#!topic/rsem-users/usmPKgsC5LU
            # Note: N1 = nUnique + nMulti

            stopifnot(file.exists(source[1]))
            isDir <- file.info(source[1])$isdir
            if (isDir) {
                files <- system(paste("find", source, "-name \"*.rsem.cnt\""), intern=TRUE)
                stopifnot(length(files) > 0)
                samples <- gsub("_rsem.cnt", "", basename(files), fixed=TRUE)
            } else {
                files <- source
                samples <- gsub(".rsem.cnt", "", basename(files), fixed=TRUE)
            }
            metrics <- list()
            for (i in 1:length(files)) {
                m <- read.table(files[i], header=FALSE, sep=" ", comment.char="#", stringsAsFactors=FALSE, nrows=3, fill=TRUE)
                metrics[[i]] <- data.frame(Sample=samples[i],
                File=files[i],
                TotalReads=m[1, 4],
                AlignedReads=m[1, 2],
                UniquelyAlignedReads=m[2, 1],
                stringsAsFactors=FALSE)
            }
            metrics <- do.call(rbind, metrics)
            row.names(metrics) <- metrics$Sample

            return(metrics)
        }
        sourceRSEM = c(${_input:r,})
        sourceRSEM = sourceRSEM[seq(3,length(sourceRSEM),3)]
        metrics.RSEM = readRSEM.cnt(sourceRSEM)
        write.table(metrics.RSEM, file="${_output[-1]}",col.names=TRUE, row.names=FALSE, quote=FALSE)

Step 7: summarize with MultiQC#

MultiQC

MultiQC is a reporting tool that parses summary statistics from results and log files generated by other bioinformatics tools. MultiQC doesn’t run other tools for you - it’s designed to be placed at the end of analysis pipelines or to be run manually when you’ve finished running your tools. When you launch MultiQC, it recursively searches through any provided file paths and finds files that it recognises. It parses relevant information from these and generates a single stand-alone HTML report file. It also saves a directory of data files with all parsed data for further downstream use.

MultiQC will automatically generate QC report for anything embedded within the given directory. Therefore providing the directory containing all the output will surfice.

The output of MultiQC is a multi-module report each corresponding to the quality report of each step of analysis previously performed.

[rsem_call_3,rnaseqc_call_3]
output: f'{_input[0]:nnn}.multiqc_report.html'
task: trunk_workers = 1, walltime = walltime, mem = mem, cores = numThreads, trunk_size = job_size
report: output = f"{_output:n}.multiqc_config.yml"
  extra_fn_clean_exts:
      - '_rsem'
  fn_ignore_dirs:
      - '*_STARpass1'
bash:  container=container,expand= "${ }", stderr = f'{_output:n}.stderr', stdout = f'{_output:n}.stdout', entrypoint=entrypoint
    multiqc ${_input:d} -v -n ${_output:b} -o ${_output:d} -c ${_output:n}.multiqc_config.yml

[rsem_call_4, rnaseqc_call_4]
# Path to flat reference file, for computing QC metric
output: f'{_input[0]:nn}.picard.aggregated_quality.metrics.tsv'# it will be outputed in a very 'deep' directory with {cwd}/
task: trunk_workers = 1, walltime = walltime, mem = mem, cores = numThreads, trunk_size = job_size
R: container=container, expand= "${ }", stderr = f'{_output:n}.stderr', stdout = f'{_output:n}.stdout', entrypoint=entrypoint
    ## Define Function
    readPicard.alignment_summary_metrics <- function (source) {
      stopifnot(length(source) == 1)
      stopifnot(file.exists(source))
      isDir <- file.info(source)$isdir
      if (isDir) {
        files <- system(paste("find -L", source, "-name \"*.alignment_summary_metrics\""), intern=TRUE)
        stopifnot(length(files) > 0)
        samples <- gsub(".alignment_summary_metrics", "", basename(files), fixed=TRUE)
      } else {
        files <- source
        samples <- gsub(".alignment_summary_metrics", "", basename(files), fixed=TRUE)
      }
    
      metrics <- list()
      for (i in 1:length(files)) {
        m <- read.table(files[i], header=TRUE, sep="\t", comment.char="#", stringsAsFactors=FALSE, nrows=${is_paired_end}+1)
        metrics[[i]] <- data.frame(Sample=samples[i],
                                   File=files[i],
                                   PF_READS=sum(m$PF_READS[1:2]),
                                   PF_READS_ALIGNED=sum(m$PF_READS_ALIGNED[1:2]),
                                   PCT_PF_READS_ALIGNED=sum(m$PF_READS_ALIGNED[1:2])/sum(m$PF_READS[1:2]),
                                   stringsAsFactors=FALSE)
      }
      metrics <- do.call(rbind, metrics)
      row.names(metrics) <- metrics$Sample
    
      return(metrics)
    }

    readPicard.rna_metrics <- function(source) {

      stopifnot(length(source) == 1)
      stopifnot(file.exists(source))
      isDir <- file.info(source)$isdir
      if (isDir) {
        files <- system(paste("find -L", source, "-name \"*.rna_metrics\""), intern=TRUE)
        stopifnot(length(files) > 0)
        samples <- gsub(".rna_metrics", "", basename(files), fixed=TRUE)
      } else {
        files <- source
        samples <- gsub(".rna_metrics", "", basename(files), fixed=TRUE)
      }

      metrics <- list()
      for (i in 1:length(files)) {
        m <- read.table(files[i], header=TRUE, sep="\t", comment.char="#", stringsAsFactors=FALSE, nrows=1)
        metrics[[i]] <- data.frame(Sample=samples[i],
                                   File=files[i],
                                   PCT_RIBOSOMAL_BASES=m$PCT_RIBOSOMAL_BASES,
                                   PCT_CODING_BASES=m$PCT_CODING_BASES,
                                   PCT_UTR_BASES=m$PCT_UTR_BASES,
                                   PCT_INTRONIC_BASES=m$PCT_INTRONIC_BASES,
                                   PCT_INTERGENIC_BASES=m$PCT_INTERGENIC_BASES,
                                   PCT_MRNA_BASES=m$PCT_MRNA_BASES,
                                   PCT_USABLE_BASES=m$PCT_USABLE_BASES,
                                   MEDIAN_CV_COVERAGE=m$MEDIAN_CV_COVERAGE,
                                   MEDIAN_5PRIME_BIAS=m$MEDIAN_5PRIME_BIAS,
                                   MEDIAN_3PRIME_BIAS=m$MEDIAN_3PRIME_BIAS,
                                   MEDIAN_5PRIME_TO_3PRIME_BIAS=m$MEDIAN_5PRIME_TO_3PRIME_BIAS,
                                   stringsAsFactors=FALSE)
      }
      metrics <- do.call(rbind, metrics)
      row.names(metrics) <- metrics$Sample
    
      return(metrics)
    }


    readPicard.duplicate_metrics <- function(source) {

      stopifnot(length(source) == 1)
      stopifnot(file.exists(source))
      #pattern_suffix <- ifelse("${varVCFfile}"!="","_wasp_qc","")
      #pattern_suffix <- ifelse("${varVCFfile}"!="","_wasp_qc","")
      pattern <- paste0('*.Aligned.sortedByCoord.out',ifelse("${varVCFfile}"!="","_wasp","_nowasp"),ifelse("${varVCFfile}"!="","_qc","_noqc"),'.md.metrics')
      substitute <- paste0('.Aligned.sortedByCoord.out',ifelse("${varVCFfile}"!="","_wasp","_nowasp"),ifelse("${varVCFfile}"!="","_qc","_noqc"),'.md.metrics')
      #substitute <- paste0('.Aligned.sortedByCoord.out',pattern_suffix,'.md.metrics')
      #f'{cwd}/{_sample_id}.Aligned.toTranscriptome.out{"_wasp" if wasp else "_nowasp"}{"_qc" if wasp else "_noqc"}.bam'
      isDir <- file.info(source)$isdir
      if (isDir) {
        files <- system(paste("find -L", source, "-name",pattern), intern=TRUE)
        stopifnot(length(files) > 0)
        samples <- gsub(substitute, "", basename(files), fixed=TRUE)
      } else {
        files <- source
        samples <- gsub(substitute, "", basename(files), fixed=TRUE)
      }

      metrics <- list()
      for (i in 1:length(files)) {
        m <- read.table(files[i], header=TRUE, sep="\t", comment.char="#", stringsAsFactors=FALSE, nrows=1)
        metrics[[i]] <- data.frame(Sample=samples[i],
                                   File=files[i],
                                   PERCENT_DUPLICATION=m$PERCENT_DUPLICATION,
                                   ESTIMATED_LIBRARY_SIZE=m$ESTIMATED_LIBRARY_SIZE,
                                   stringsAsFactors=FALSE)
      }
      metrics <- do.call(rbind, metrics)
      row.names(metrics) <- metrics$Sample
    
      return(metrics)
    }


    readPicard <- function(source) {
      metrics.aln <- readPicard.alignment_summary_metrics(source)
      metrics.rna <- readPicard.rna_metrics(source)
      metrics.dup <- readPicard.duplicate_metrics(source)

      stopifnot(all(row.names(metrics.aln) %in% row.names(metrics.rna)) &
                all(row.names(metrics.rna) %in% row.names(metrics.dup)) &
                all(row.names(metrics.dup) %in% row.names(metrics.aln)))

      metrics.aln$File <- NULL
      metrics.rna$File <- NULL
      metrics.dup$File <- NULL
      metrics.rna$Sample <- NULL
      metrics.dup$Sample <- NULL

      metrics <- cbind(metrics.aln, metrics.rna[row.names(metrics.aln), ])
      metrics <- cbind(metrics, metrics.dup[row.names(metrics.aln), ])

      return(metrics)
    }

    ## Execution  
    sourcePicard = ${_input[-1]:dr}

    Picard_qualityMetrics <- readPicard(sourcePicard)
    write.table(Picard_qualityMetrics, file="${_output}",col.names=TRUE, row.names=FALSE, quote=FALSE)


# Need to considering .checkpoint directory(hidden) when there's interruption caused by previous error in [STAR_lign] on interactive sessions. It might cause inconsistent length of metrics of metrics.aln.

Step 8: Filter Reads with WASP#

This optional step in the STAR alignment process offers filtration based on WASP correction and unique read identification. WASP correction mitigates reference allele bias in RNA-seq data by eliminating reads with specific WASP flags. Unique read filtering retains only uniquely mapped reads, identified by a mapping quality (MAPQ) value of 255, as per GTEx Consortium standards. These filters can be applied individually or in combination to enhance alignment quality and prepare BAM files for different analysis.

Step Inputs:#

No input needed to specify manually. All files are from the output from STAR_align_1

Step Outputs:#

A collection of filtered(or not) .bam files

[filter_reads]
parameter: unique=""
parameter: wasp=""
parameter: mapping_quality = 255
depends: sos_variable('strand')
input: output_from("STAR_align_1"), group_by = 2, group_with = {"sample_id","strand"}
output: cord_bam_wasp_qc = f'{cwd}/{_sample_id}.Aligned.sortedByCoord.out{"_wasp" if wasp else "_nowasp"}{"_qc" if wasp else "_noqc"}.bam',
        trans_bam_wasp_qc = f'{cwd}/{_sample_id}.Aligned.toTranscriptome.out{"_wasp" if wasp else "_nowasp"}{"_qc" if wasp else "_noqc"}.bam'
task: trunk_workers = 1, trunk_size = job_size, walltime = walltime, mem = mem, cores = numThreads
bash: container=container, expand= "$[ ]", stderr = f'{_output[0]:n}.stderr', stdout = f'{_output[0]:n}.stdout', entrypoint=entrypoint

    # WASP-based QC filtering function
    wasp_filter() {
        # Check if an argument is provided
        if [[ -z "$1" ]]; then
            echo "Usage: wasp_filter <file.something.ext>"
            return 1
        fi
        local input_file="$1"
        local base_name="${input_file%.*}"  # Removes the last extension
        local derived_file="${base_name%.*}.int.bam"  # Removes the second to last extension and adds .ext     

        if [ $[unique] ]; then # pay attention to the spcae around the variable in [[]].
            # wasp_filtering & unique_filtering
            if [ $[wasp] ]; then
                echo "Applying unique_WASP filtering"
                samtools view -h -q $[mapping_quality] "$input_file" | grep -v "vW:i:[2-7]" | samtools view -b > "$derived_file"
            # unique_filtering
            else
                echo "Applying unique filtering"
                samtools view -h -q $[mapping_quality] "$input_file" | samtools view -b > "$derived_file" # remember to convert SAM files back to BAM files
            fi
        else
            # only wasp_filtering: Potential Allele-Specific Alignment Bias
            if [ $[wasp] ]; then
                echo "Applying WASP filtering"
                samtools view -h "$input_file" | grep -v "vW:i:[2-7]" | samtools view -b > "$derived_file"
            # no filter
            else
                echo "Applying no filtering"
                cp "$input_file" "$derived_file"
            fi
        fi
        return 0
    }

    # I tried using the filtered bam file to replace the original bam file, but I got "Exec format error". Don't do that! It's risky for the file being overwritten and being read simultaneously.

    export -f wasp_filter  # Make it available in subshells, if needed

    # Perform WASP-based QC filtering
    wasp_filter $[_input[0]]
    wasp_filter $[_input[1]] 
    mv $[_output[0]:nnnn].Aligned.sortedByCoord.int.bam  $[_output[0]]
    mv $[_output[1]:nnnn].Aligned.toTranscriptome.int.bam  $[_output[1]]

Quantifying expression from RNA-seq data

Contents

Quantifying expression from RNA-seq data#

Description#

Input#

Output#

Minimal Working Example Steps#

i. Perform data quality summary via fastqc#

ii. Cut adaptor (Optional)#

iii. Read alignment via STAR and QC via Picard#

iv. Call gene-level RNA expression via rnaseqc#

v. Call transcript level RNA expression via RSEM#

Troubleshooting#

Command interface#

Setup and global parameters#

Step 0: Convert BAM back to fastq if necessary#

Step 1: QC before alignment#

Step Inputs#

Step Outputs#

Step 2: Remove adaptor through fastp#

Step Inputs#

Step Outputs#

Step options#

Step 2 Alternative: Remove adaptor through Trimmomatic#

Step Inputs#

Step Outputs#

Step 3: Alignment through STAR#

Step Inputs#

Step Outputs#

Step 4: Mark duplicates reads & QC through Picard#

Step Inputs:#

Step Outputs:#

Step 5: Post aligment QC through RNA-SeQC#

Step Inputs#

Step Outputs#

Step 6: Quantify expression through RSEM#

Step Input#

Step Outputs#

Step 7: summarize with MultiQC#

Step 8: Filter Reads with WASP#

Step Inputs:#

Step Outputs:#

i. Perform data quality summary via `fastqc`#

Step 2: Remove adaptor through `fastp`#

Step 2 Alternative: Remove adaptor through `Trimmomatic`#

Step 3: Alignment through `STAR`#

Step 4: Mark duplicates reads & QC through `Picard`#

Step 5: Post aligment QC through `RNA-SeQC`#

Step 6: Quantify expression through `RSEM`#