Quantifying expression from RNA-seq data

Quantifying expression from RNA-seq data#

Description#

Our pipeline follows the GTEx pipeline from the GTEx/TOPMed project for the quantification of expression from RNA-seq data. Either paired end or single end fastq.gz files may be used as the initial input. Different read strandedness options are possible, including rf, fr, unstranded or strand_missing, based on library types from Signal et al [cf. Signal et al (2022)]. Strand detection steps are included in this pipeline to automaticaly detect the strand of the input fastq file by levradging the gene count table ouptut from STAR. Read length data is required and is set to a default value of 100 for reads of zero length.

We recommend using fastqc to quality control reads, followed by the removal of adaptors with fastp if necessary. After quality control has been conducted, STAR may be used to align reads to the reference genome. We combine this mapping step with an additionaly quality control step using Picard to mark duplicate reads and collect other metrics.

Gene-level RNA expression is called with RNA-SeQC v2.4.2 using a reference gtf that has been collapsed to contain genes instead of gene transcripts [cf. DeLuca et al., Bioinformatics, 2012]. Reads are only included if they are uniquely mapped (mapping quality of 255 for STAR BAMs), if they are aligned in proper pairs, if the read alignment distance was less than or equal to six (i.e. alignments must not contain more than six non-reference bases), and if they are fully contained within exon boundaries. Reads overlapping introns are not included. Exon level read counts are also called by RNA-SeQC. If a read overlaps multiple exons, then a fractional value equal to the portion of the read contained within the exon is allotted. We call transcript-level RNA expression using RSEM v1.3.0 with a transcript reference gtf.

Timing: ~10-30 min on typical compute infrastructure.

Input#

A tab delimited file containing sample read information with one line per sample. This file should include a header for the following columns:

ID - sample ID (required)
fq1 - name of the fastq file (required)
fq2 - name of the second fastq file of the sample if using paired end data (required only if using paired end data)
strand - the strandedness of the sample (optional)
Options include rf, fr, unstranded or strand_missing, based on library types from Signal et al [cf. Signal et al. 2022]. Strand detection steps are included in this pipeline to automaticaly detect the strand of the input fastq file by leveraging the gene count table ouptut from STAR.
read_length - the read length of the sample’s reads (optional)
If this is unknown, enter 0 and the pipeline will use a default read length of 100. This default can be changed using the --sjdbOverhang parameter. If none of the samples have read length information then please leave out the read_length column so that the pipeline uses the default length for each sample.

Output#

The pipeline runs per sample under --cwd and produces, in addition to the intermediate alignment files described in the steps below:

{sample_id}.Aligned.sortedByCoord.out.bam — coordinate-sorted STAR alignments (a WASP-filtered ..._wasp_qc...bam variant is also produced for allele-specific work).
{sample_id}.rsem.genes.results and {sample_id}.rsem.isoforms.results — per-sample RSEM gene- and transcript-level expression estimates (counts and TPM); .rsem.stat/ holds the RSEM run statistics.
{sample_id}.rnaseqc.gene_tpm.gct.gz — RNA-SeQC gene-level TPM matrix used for QC.
{sample_id}.multiqc_report.html and *.alignment_summary_metrics — MultiQC and Picard alignment-QC summaries.
{bam_list}.rsem_transcripts_expected_count.txt.gz and the merged *.rnaseqc.gene_tpm.gct.gz — cohort-level expression matrices that merge all samples, ready for downstream normalization.

Input Files#

The pipeline is driven by a tab-delimited sample list and a directory of paired-end FASTQ files, plus shared reference data.

File	Description
`input/rnaseq/protocol_example.rnaseq.fastq.list.txt`	Sample manifest (`ID`, `fq1`, `fq2`, `strand`, `read_length`); two demo samples `SAMPLE_001`, `SAMPLE_002`
`input/rnaseq/protocol_example.rnaseq.fastq/`	Paired-end FASTQ files referenced by `fq1`/`fq2`
`reference_data/STAR_Index/`	STAR genome index (chromosome-subset for the demo)
`reference_data/RSEM_Index/`	RSEM reference (matching transcript-level GTF)
`reference_data/GRCh38_full_analysis_set_plus_decoy_hla.noALT_noHLA_noDecoy_ERCC.fasta`	Reference genome FASTA
`reference_data/Homo_sapiens.GRCh38.103.chr.reformatted.ERCC.gtf`	Transcript-level annotation (STAR / RSEM)
`reference_data/Homo_sapiens.GRCh38.103.chr.reformatted.collapse_only.gene.ERCC.gtf`	Collapsed gene-level annotation (RNA-SeQC)
`reference_data/Homo_sapiens.GRCh38.103.chr.reformatted.ERCC.ref.flat`	refFlat annotation (Picard)

Minimal Working Example#

The commands below run each stage of the pipeline on the two demo samples. They use the simplest alignment recipe (no WASP / no unique filtering), which is appropriate for quantifying expression for eQTL analysis. Run them in order; each stage consumes the output of the previous one.

Quality control with `fastqc`#

sos run pipeline/RNA_calling.ipynb fastqc \
    --cwd output/rnaseq/fastqc \
    --sample-list input/rnaseq/protocol_example.rnaseq.fastq.list.txt \
    --data-dir input/rnaseq/

Trim adaptors with `fastp` (optional)#

Generates trimmed FASTQ files and a new sample list pointing at them.

sos run pipeline/RNA_calling.ipynb fastp_trim_adaptor \
    --cwd output/rnaseq \
    --sample-list input/rnaseq/protocol_example.rnaseq.fastq.list.txt \
    --data-dir input/rnaseq/ \
    --STAR-index reference_data/STAR_Index \
    --gtf input/reference_data/Homo_sapiens.GRCh38.103.chr.reformatted.ERCC.gtf \
    --reference-fasta reference_data/GRCh38_full_analysis_set_plus_decoy_hla.noALT_noHLA_noDecoy_ERCC.fasta \
    --ref-flat reference_data/Homo_sapiens.GRCh38.103.chr.reformatted.ERCC.ref.flat 

Align reads with `STAR`#

Aligns reads and produces the per-run BAM manifest (<sample-list-stem>.bam_file_list) that the downstream RSEM step consumes. The recipe below adds no WASP/unique-filter tags (fastest; suitable for eQTL expression quantification). Other recipes are available by adding --wasp yes and/or --unique yes together with a --varVCFfile.

sos run pipeline/RNA_calling.ipynb STAR_align \
    --cwd output/rnaseq/bam \
    --sample-list input/rnaseq/protocol_example.rnaseq.fastq.list.txt \
    --data-dir input/rnaseq/ \
    --STAR-index input/reference_data/STAR_Index \
    --gtf input/reference_data/Homo_sapiens.GRCh38.103.chr.reformatted.ERCC.gtf \
    --reference-fasta reference_data/GRCh38_full_analysis_set_plus_decoy_hla.noALT_noHLA_noDecoy_ERCC.fasta \
    --ref-flat input/reference_data/Homo_sapiens.GRCh38.103.chr.reformatted.ERCC.ref.flat \
    --chimSegmentMin 0 \
    -J 2 --mem 16G --numThreads 2 

Gene-level expression with `rnaseqc`#

Uses the collapsed gene-level GTF. Point --bam_list at the manifest produced by STAR_align and keep --cwd consistent with where the BAMs live.

sos run pipeline/RNA_calling.ipynb rnaseqc_call \
    --cwd output/rnaseq/bam \
    --sample-list input/rnaseq/protocol_example.rnaseq.fastq.list.txt \
    --data-dir input/rnaseq/ \
    --gtf input/reference_data/Homo_sapiens.GRCh38.103.chr.reformatted.collapse_only.gene.ERCC.gtf \
    --reference-fasta input/reference_data/GRCh38_full_analysis_set_plus_decoy_hla.noALT_noHLA_noDecoy_ERCC.fasta \
    --bam_list output/rnaseq/bam/protocol_example.rnaseq.fastq.list.bam_file_list 

Transcript-level expression with `RSEM`#

Uses the transcript-level GTF and the RSEM reference. As above, --bam_list is the STAR_align manifest and --cwd must match the BAM location (the manifest stores bare filenames that are resolved relative to --cwd).

sos run pipeline/RNA_calling.ipynb rsem_call \
    --cwd output/rnaseq/bam \
    --sample-list input/rnaseq/protocol_example.rnaseq.fastq.list.txt \
    --data-dir input/rnaseq/ \
    --STAR-index reference_data/STAR_Index \
    --gtf reference_data/Homo_sapiens.GRCh38.103.chr.reformatted.ERCC.gtf \
    --reference-fasta reference_data/GRCh38_full_analysis_set_plus_decoy_hla.noALT_noHLA_noDecoy_ERCC.fasta \
    --ref-flat reference_data/Homo_sapiens.GRCh38.103.chr.reformatted.ERCC.ref.flat \
    --bam_list output/rnaseq/bam/protocol_example.rnaseq.fastq.list.bam_file_list \
    --RSEM-index reference_data/RSEM_Index 

Output Files#

File	Description
`output/rnaseq/fastqc/_fastqc.html`, `_fastqc.zip`	Per-sample FastQC reports
`output/rnaseq/bam/<sample>.Aligned.sortedByCoord.out.bam`	STAR genome-coordinate alignments
`output/rnaseq/bam/<sample>.Aligned.toTranscriptome.out.bam`	STAR transcriptome alignments (RSEM input)
`output/rnaseq/bam/<sample-list-stem>.bam_file_list`	Per-run BAM manifest consumed by `rnaseqc_call` / `rsem_call`
`output/rnaseq/bam/<sample>.rsem.genes.results`, `*.rsem.isoforms.results`	RSEM per-sample quantifications
`output/rnaseq/bam/rsem_genes_tpm.txt.gz`, `rsem_transcripts_tpm.txt.gz`	Merged gene/transcript TPM matrices
`output/rnaseq/bam/*multiqc_report.html`	Aggregated QC report

Troubleshooting#

Step	Substep	Problem	Possible Reason	Solution
`rsem_call` / `rnaseqc_call`	-	`Target unavailable: .../<name>.Aligned.toTranscriptome.out_*.bam`	`--bam_list` points at a manifest whose bare filenames do not exist under the given `--cwd` (the step prepends `{cwd}/` to each entry)	Use the `STAR_align`-generated `<sample-list-stem>.bam_file_list` and set `--cwd` to the directory that actually holds those BAMs

Command interface#

!sos run RNA_calling.ipynb -h

Setup and global parameters#

[global]
parameter: modular_script_dir = path('code/script')  # override with --modular-script-dir
# The output directory for generated files.
parameter: cwd = path("output")
# Sample meta data list
parameter: sample_list = path
# Sample names to analyze
parameter: sample_name = list() #input should be --sample_name sample1 sample2, if multiple samples
# Raw data directory, default to the same directory as sample list
parameter: data_dir = path(f"{sample_list:d}")
# For cluster jobs, number commands to run per job
parameter: job_size = 1
# Wall clock time expected
parameter: walltime = "5h"
# Memory expected
parameter: mem = "16G"
# Memory for Java virtual mechine (`picard`)
parameter: java_mem = "6G"
# Number of threads
parameter: numThreads = 8
# Software container option
parameter: container = ""
# VarVCFfile for data preparation for wasp_filtering
parameter: varVCFfile = ""
import re
parameter: entrypoint= ('micromamba run -a "" -n' + ' ' + re.sub(r'(_apptainer:latest|_docker:latest|\.sif)$', '', container.split('/')[-1])) if container else ""
from sos.utils import expand_size
cwd = path(f'{cwd:a}')
data_dir = path(f'{data_dir:a}')
sample_list = path(f'{sample_list:a}')
modular_script_dir = path(f'{modular_script_dir:a}')
## Whether the fasta/fastq file is compressed or not.
parameter: uncompressed = False
import os
import pandas as pd
def fastqc_output_base(target):
    target = target[0] if hasattr(target, '__getitem__') and not isinstance(target, (str, bytes, os.PathLike)) else target
    name = os.path.basename(str(target))
    for suffix in (".fastq.gz", ".fq.gz", ".fastq", ".fq", ".bam", ".sam", ".cram"):
        if name.endswith(suffix):
            return name[:-len(suffix)]
    return os.path.splitext(name)[0]
## FIX: The way to get sample needs to be revamped to 1. Accomodate rf/fr as column 2. accomodate single end read (Only 2 fq/samples)
sample_inv = pd.read_csv(sample_list,sep = "\t")

# Align specific sample, to ensure correct grouping in subsequent steps, perform this step prior to extracting strand, read_length, and sample_id information
if len(sample_name)>0:
    print("Align sample",sample_name, 'only...')
    sample_inv = sample_inv[sample_inv['ID'].isin(sample_name)] 
    
## Extract strand information if user have specified the strand
strand_inv = []

if "strand" in sample_inv.columns:
    strand_inv = sample_inv.strand.values.tolist()
    sample_inv = sample_inv.drop("strand" , axis = 1)
    stop_if(not all([x in ["fr", "rf", "unstranded","strand_missing"] for x in strand_inv ]), msg = "strand columns should only include ``fr``, ``rf``, ``strand_missing`` or ``unstranded``")
            
## Extract read_length if user have specified read_length
read_length = [0] * len(sample_inv.index)
if "read_length" in sample_inv.columns:
    read_length = sample_inv.read_length.values.tolist()
    sample_inv = sample_inv.drop("read_length" , axis = 1)
    

## Read columns follow the documented fq1/fq2 contract; legacy one/two-read-column sample sheets are also accepted
file_columns = [x for x in ["fq1", "fq2"] if x in sample_inv.columns]
if len(file_columns) == 0:
    legacy_file_columns = [x for x in sample_inv.columns if x != "ID"]
    stop_if(len(legacy_file_columns) == 0 or len(legacy_file_columns) > 2,
        msg = "Sample list must contain fq1 and optional fq2 columns, or a legacy ID + one/two read-column layout after removing strand/read_length")
    file_columns = legacy_file_columns

## Extract sample_id
sample_inv_list = sample_inv[["ID"] + file_columns].values.tolist()
sample_id = [x[0] for x in sample_inv_list]

## Get the file name for single/paired end data
file_inv = [x[1:] for x in sample_inv_list]
file_inv = [item for sublist in file_inv for item in sublist if pd.notna(item) and str(item).strip() != ""]

raw_reads = [f'{data_dir}/{x}' for x in file_inv]


for y in raw_reads:
        if not os.path.isfile(y):
            raise ValueError(f"File {y} does not exist")

if len(raw_reads) != len(set(raw_reads)):
        raise ValueError("Duplicated files are found (but should not be allowed) in sample file list")

# Is the RNA-seq data pair-end
is_paired_end = 0 if len(raw_reads) == len(sample_id) else 1 
from sos.utils import env
env.logger.info(f'Input samples are {"paired-end" if is_paired_end else "single-end"} sequences.')

Step 0. Convert BAM back to fastq if necessary#

[bam_to_fastq]
input: raw_reads, group_by = 1
output: f'{cwd}/{_input:bn}.1.fastq',f'{cwd}/{_input:bn}.2.fastq'
task: trunk_workers = 1, trunk_size = job_size, walltime = walltime, mem = mem, cores = numThreads
bash: expand= "${ }", stderr = f'{_output[0]:n}.stderr', stdout = f'{_output[0]:n}.stdout', container = container, entrypoint = entrypoint
    bash ${modular_script_dir}/molecular_phenotypes/calling/RNA_calling.sh bam_to_fastq \
        --cwd "${cwd}" \
        --input-bam "${_input}" \
        --sorted-bam "${_output[0]:nn}.sorted.bam" \
        --output-fastq1 "${_output[0]}" \
        --output-fastq2 "${_output[1]}"

Step 1. QC before alignment#

This step utilize fastqc and will generate two QC report in html format. It should be noted that the paired_end reads will also be qc separately

Step Inputs#

fastq1 and fastq2: paths to original fastq.gz file.

Step Outputs#

Two html file for QC report

[fastqc]
input: raw_reads, group_by =  1
output: f'{cwd}/{fastqc_output_base(_input)}_fastqc.html',f'{cwd}/{fastqc_output_base(_input)}_fastqc.zip' 
task: trunk_workers = 1, trunk_size = job_size, walltime = walltime, mem = mem, cores = numThreads
bash: expand= "${ }", stderr = f'{_output[0]:n}.stderr', stdout = f'{_output[0]:n}.stdout', container = container, entrypoint = entrypoint
    bash ${modular_script_dir}/molecular_phenotypes/calling/RNA_calling.sh fastqc \
        --cwd "${cwd}" \
        --sample-list "${sample_list}" \
        --data-dir "${data_dir}" \
        --input-fastq "${_input}" \
        --numThreads ${numThreads}

Step 2. Remove adaptor through `fastp`#

Documentation: fastp

We use fastp in place of the Trimmomatic for fastp’s ability to detect adapter from the reads. It was a c++ command line tool published in Sept 2018. It will use the following algorithm to detect the adaptors:

The adapter-sequence detection algorithm is based on two assumptions: the first is that only one adapter exists in the data; the second is that adapter sequences exist only in the read tails. These two assumptions are valid for major next-generation sequencers like Illumina HiSeq series, NextSeq series and NovaSeq series. We compute the k-mer (k = 10) of first N reads (N = 1 M). From this k-mer, the sequences with high occurrence frequencies (>0.0001) are considered as adapter seeds. Low-complexity sequences are removed because they are usually caused by sequencing artifacts. The adapter seeds are sorted by its occurrence frequencies. A tree-based algorithm is applied to extend the adapter seeds to find the real complete adapter

It was demostrated that fastp can remove all the adaptor automatically and completely faster than Trimmomatic and cutadapt

Step Inputs#

fastq: 1 set of fq.gz file for each sample. (2 files if paired end, 1 file if single end )

Step Outputs#

1 set of fastq.gz file for alignment. (2 files if paired end, 1 file if single end )
The unpaired reads (i.e.where a read survived, but the partner read did not.) will be discarded by default as those were not used in the following steps. This feature can be added were it was needed.
1 set of html documenting the quality of input reads(1 html file and 1 json file)

Step options#

A few options were selected to be customizable so that fastp step can have the same level of flexiblity as the Trimmomatic step had. They are:

min_len: length_required, reads shorter than this will be discarded, default is 15. (int [=15])
window_size: cut_window_size, the window size option shared by cut_front, cut_tail or cut_sliding. Range: 1~1000, default: 4 (int [=4])
leading/trailing : cut_front_mean_quality/cut_tail_mean_quality, the mean quality requirement option for cut_front/cut_tail, which move a sliding window from front/tail, drop the bases in the window if its mean quality < threshold. Notice the choice of quality score in fastp (N=20) is a lot higher than that of trimmomatic (N=3). Default fastp setting is in line with that of cutadapt.

The default value are set to be the same as the default value of the fastp software.

[fastp_trim_adaptor_1]
# sliding window setting
parameter: window_size = 4
parameter: required_quality = 20
# the mean quality requirement option for cut_front
parameter: leading = 20
# the mean quality requirement option for cut_tail
parameter: trailing = 20
# reads shorter than length_required will be discarded
parameter: min_len = 15
# Path to the reference adaptors
parameter: fasta_with_adapters_etc = path(".")
warn_if(fasta_with_adapters_etc.is_file(),msg = "Use input fasta and adaptor detection of paired-end read was disabled" )

input: raw_reads, group_by = is_paired_end + 1 , group_with = "sample_id"
output: [f'{cwd}/{path(x):bn}.trimmed.fq.gz' for x in _input]
task: trunk_workers = 1, trunk_size = job_size, walltime = walltime, mem = mem, cores = numThreads
bash:  container=container,expand= "${ }", stderr = f'{_output[0]:n}.stderr', stdout = f'{_output[0]:n}.stdout', entrypoint=entrypoint
        fastp -i ${f'{_input[0]} -I {_input[1]}' if is_paired_end else _input} -o ${ f'{_output[0]} -O {_output[1]}' if is_paired_end else _output } \
            ${f'--adapter_fasta {fasta_with_adapters_etc}' if fasta_with_adapters_etc.is_file() else "--detect_adapter_for_pe"}  -V -h ${_output[0]:n}.html -j ${_output[0]:n}.json -w ${numThreads} \
            --length_required ${min_len}  -W ${window_size} -M ${required_quality} -5 -3 --cut_front_mean_quality ${leading} --cut_tail_mean_quality ${leading}

[fastp_trim_adaptor_2]
input: group_by = "all"
output: f'{cwd}/{sample_list:n}.trimmed.txt'
sample_inv_tmp = pd.read_csv(sample_list,sep = "\t")
import csv
if is_paired_end:
    sample_inv_tmp.fq1 = [f'{x:r}'.replace("'","") for x in _input][::2]
    sample_inv_tmp.fq2 = [f'{x:r}'.replace("'","") for x in _input][1::2]
else:
    sample_inv_tmp.fq1 = [f'{x:r}'.replace("'","") for x in _input]
sample_inv_tmp.to_csv(_output,sep = "\t",index = 0)

Step 2 Alternative: Remove adaptor through `Trimmomatic`#

Documentation: Trimmomatic

We have replaced this with fastp, see above, which performs better than Trimmomatic in terms of removing adaptors that Trimmomatic cannot detect. fastp can also automatically guess the adapter sequences from data and by default no adapter sequence is required for input to fastp.

Step Inputs#

software_dir: directory for the software
fasta_with_adapters_etc: filename for the adapter reference file. According to Trimmomatic documention,

As a rule of thumb newer libraries will use TruSeq3, but this really depends on your service provider. If you use FASTQC, the “Overrepresented Sequences” report can help indicate which adapter file is best suited for your data. “Illumina Single End” or “Illumina Paired End” sequences indicate single-end or paired-end TruSeq2 libraries, and the appropriate adapter files are TruSeq2-SE.fa and TruSeq2-PE.fa respectively. “TruSeq Universal Adapter” or “TruSeq Adapter, Index …” indicates TruSeq-3 libraries, and the appropriate adapter files are TruSeq3-SE.fa or TruSeq3-PE.fa, for single-end and paired-end data respectively. Adapter sequences for TruSeq2 multiplexed libraries, indicated by “Illumina Multiplexing …”, and the various RNA library preparations are not currently included.

We have fastqc workflow previously defined and executed. Users should decide what fasta adapter reference to use based on fastqc results (or their own knowledge).

Step Outputs#

Two paired fastq.gz file for alignment
Two unpaired fastq.gz

For single-ended data, one input and one output file are specified, plus the processing steps. For paired-end data, two input files are specified, and 4 output files, 2 for the ‘paired’ output where both reads survived the processing, and 2 for corresponding ‘unpaired’ output where a read survived, but the partner read did not.

You need to figure out from fastqc results what adapter reference sequence to use., eg --fasta_with_adapters_etc TruSeq3-PE.fa. These files can be downloaded from Trimmomatic github repo.

[trimmomatic_trim_adaptor]
# Illumina clip setting
# Path to the reference adaptors
parameter: fasta_with_adapters_etc = path(".")
parameter: seed_mismatches = 2
parameter: palindrome_clip_threshold = 30
parameter: simple_clip_threshold = 10
# sliding window setting
parameter: window_size = 4
parameter: required_quality = 20
# Other settings
parameter: leading = 3
parameter: trailing = 3
parameter: min_len = 50
input: raw_reads, group_by = is_paired_end + 1 , group_with = "sample_id"
output: ([ f'{cwd}/{_sample_id}_paired_{_input[0]:bn}.gz', f'{cwd}/{_sample_id}_unpaired_{_input[0]:bn}.gz',  f'{cwd}/{_sample_id}_paired_{_input[1]:bn}.gz',f'{cwd}/{_sample_id}_unpaired_{_input[1]:bn}.gz' ] if is_paired_end else f'{cwd}/{_sample_id}_trimmed_{_input:bn}.gz')
task: trunk_workers = 1, trunk_size = job_size, walltime = walltime, mem = mem, cores = numThreads
bash: container=container, expand= "${ }", stderr = f'{_output[0]:n}.stderr', stdout = f'{_output[0]:n}.stdout', entrypoint=entrypoint
    trimmomatic -Xmx${java_mem} ${"PE" if is_paired_end else "SE"}  -threads ${numThreads} \
        ${_input:r}  ${_output:r} \
        ILLUMINACLIP:${fasta_with_adapters_etc}:${seed_mismatches}:${palindrome_clip_threshold}:${simple_clip_threshold} \
        LEADING:${leading} TRAILING:${trailing} SLIDINGWINDOW:${window_size}:${required_quality} MINLEN:${min_len}

Step 3. Alignment through `STAR`#

Documentation : STAR and Script in docker

This step is the main step for STAR alignment.

Originally, the STAR alignment is implemented using the run_STAR.py command from gtex. However, in order to accomodate different read length among samples, We reimplement what was included in the wrapper, including the re-alignment of STAR unsorted bam using samtools to avoid high mem consumption.

Step Inputs#

paths to trimmed fastq.gz file from Step 1 as documented in the new sample list.
STAR_index: directory for the STAR aligment index

Step Outputs#

bam file output ${cwd}/{sample_id}.Aligned.sortedByCoord.bam, will be used in step 3 and 4
bam file output ${cwd}/{sample_id}.Aligned.toTranscriptome.bam, will be used in step 5

NB: “${}" and "$[]” are exchangeable as expension, just to keep them consistent throught the bash block.

[STAR_align_1]
# Reference gene model
parameter: gtf = path
# STAR indexing file
parameter: STAR_index = path
gtf = path(f'{gtf:a}')
STAR_index = path(f'{STAR_index:a}')
parameter: outFilterMultimapNmax = 20 
parameter: alignSJoverhangMin = 8 
parameter: alignSJDBoverhangMin = 1 
parameter: outFilterMismatchNmax = 999 
parameter: outFilterMismatchNoverLmax = 0.1 #larger than Yang's group (outFilterMismatchNoverReadLmax 0.04)
parameter: alignIntronMin = 20 
parameter: alignIntronMax = 1000000 
parameter: alignMatesGapMax = 1000000 
parameter: outFilterType =  "BySJout" 
parameter: outFilterScoreMinOverLread = 0.33 
parameter: outFilterMatchNminOverLread = 0.33 
parameter: limitSjdbInsertNsj = 1200000 
parameter: outSAMstrandField = "intronMotif" 
parameter: outFilterIntronMotifs = "None" 
parameter: alignSoftClipAtReferenceEnds = "Yes" 
parameter: quantMode = ["TranscriptomeSAM", "GeneCounts"]
parameter: outSAMattrRGline = ["ID:rg1", "SM:sm1"]
parameter: outSAMattributes = ["NH", "HI", "AS", "nM", "NM", "ch"] #(Yang's group: outSAMattributes NH HI AS nM XS vW). vW added is varVCFfile is set
parameter: chimSegmentMin = 15 
parameter: chimJunctionOverhangMin = 15 
parameter: chimOutType = ["Junctions", "WithinBAM", "SoftClip"]
parameter: chimMainSegmentMultNmax = 1 
parameter: sjdbOverhang = 100
parameter: mapping_quality = 255
if int(mem.replace("G","")) <  40:
    print("Insufficent memory for STAR, changing to 40G")
    star_mem = '40G'
else:
    star_mem = mem
if varVCFfile:
    if not path(varVCFfile).is_file():
        raise FileNotFoundError(f"Cannot find varVCFfile ``{varVCFfile}``")
    print("Adding wasp filter in STAR alignment")
# This option is commented out because it will force the downstream analysis to use 40G, which significantlly slow down the process.
input: raw_reads, group_by = is_paired_end + 1, group_with = {"sample_id","read_length"}
output: cord_bam = f'{cwd}/{_sample_id}.Aligned.sortedByCoord.out.bam',
        trans_bam = f'{cwd}/{_sample_id}.Aligned.toTranscriptome.out.bam'
if _read_length == 0:
    print("Using specified --sjdbOverhang as read length")
else:
    print("Using read length specified in the sample list")
task: trunk_workers = 1, trunk_size = job_size, walltime = walltime, mem = star_mem, cores = numThreads
bash: expand= "${ }", stderr = f'{_output[0]:n}.stderr', stdout = f'{_output[0]:n}.stdout', container = container, entrypoint = entrypoint
    bash ${modular_script_dir}/molecular_phenotypes/calling/RNA_calling.sh star_align_1 \
        --cwd "${cwd}" \
        --gtf "${gtf}" \
        --STAR-index "${STAR_index}" \
        --input-reads "${_input:r}" \
        --sample-id "${_sample_id}" \
        --paired-end ${is_paired_end} \
        --read-files-command "${"cat" if uncompressed else "zcat"}" \
        --output-prefix "${_output[0]:nnnn}" \
        --output-cord-bam "${_output["cord_bam"]}" \
        --output-trans-bam "${_output["trans_bam"]}" \
        --outFilterMultimapNmax ${outFilterMultimapNmax} \
        --alignSJoverhangMin ${alignSJoverhangMin} \
        --alignSJDBoverhangMin ${alignSJDBoverhangMin} \
        --outFilterMismatchNmax ${outFilterMismatchNmax} \
        --outFilterMismatchNoverLmax ${outFilterMismatchNoverLmax} \
        --alignIntronMin ${alignIntronMin} \
        --alignIntronMax ${alignIntronMax} \
        --alignMatesGapMax ${alignMatesGapMax} \
        --outFilterType "${outFilterType}" \
        --outFilterScoreMinOverLread ${outFilterScoreMinOverLread} \
        --outFilterMatchNminOverLread ${outFilterMatchNminOverLread} \
        --limitSjdbInsertNsj ${limitSjdbInsertNsj} \
        --outSAMstrandField "${outSAMstrandField}" \
        --outFilterIntronMotifs "${outFilterIntronMotifs}" \
        --alignSoftClipAtReferenceEnds "${alignSoftClipAtReferenceEnds}" \
        --quantMode "${" ".join(quantMode)}" \
        --outSAMattrRGline "${" ".join(outSAMattrRGline)}" \
        --outSAMattributes "${" ".join(outSAMattributes)}" \
        --chimSegmentMin ${chimSegmentMin} \
        --chimJunctionOverhangMin ${chimJunctionOverhangMin} \
        --chimOutType "${" ".join(chimOutType)}" \
        --chimMainSegmentMultNmax ${chimMainSegmentMultNmax} \
        --sjdbOverhang ${sjdbOverhang if _read_length == 0 else _read_length} \
        --var-vcf-file "${varVCFfile}"

[strand_detected_1: shared="step_strand_detected"]
input: output_from("STAR_align_1")["trans_bam"], group_with = "sample_id"
import pandas as pd
ReadsPerGene = pd.read_csv(f'{cwd}/{_sample_id}.ReadsPerGene.out.tab' , sep = "\t" , header = None )
ReadsPerGene_list = ReadsPerGene.loc[3::,].sum(axis = 0, numeric_only = True).values.tolist()
strand_percentage = [x/ReadsPerGene_list[0] for x in ReadsPerGene_list]
print(f'for sample {_sample_id}')
if strand_percentage[1] > 0.9:
    print(f'Counts for the 1st read strand aligned with RNA is {strand_percentage[1]}, > 90% of aligned count')
    print('Data is likely FR/fr-secondstrand')
    strand_detected = "fr"
elif  strand_percentage[2] > 0.9:
    print(f'Counts for the 2nd read strand aligned with RNA is {strand_percentage[2]}, > 90% of aligned count')
    print('Data is likely RF/fr-firststrand')
    strand_detected = "rf"
elif max( strand_percentage[1],  strand_percentage[2]) < 0.6:
    print(f'Both {strand_percentage[1]} and  {strand_percentage[2]} are under 60% of reads explained by one direction')
    print('Data is likely unstranded')
    strand_detected = "unstranded"
else:
    strand_detected = 'strand_missing' 
    print(f'Data does not fall into a likely stranded (max percent explained {max( strand_percentage[1],  strand_percentage[2])} > 0.9) or unstranded layout (max percent explained {max( strand_percentage[1],  strand_percentage[2])} < 0.6), please check your data and manually specified the ``--strand`` option as ``fr``, ``rf`` or ``unstranded``')

[strand_detected_2: shared="strand"]
input: group_by = "all"
parameter: strand = ""
import pandas as pd
if not strand:
    if len(strand_inv) > 0:
        strand = strand_inv
        for i in range(0,len(strand)):
            if strand[i] == "strand_missing":
                strand[i] = step_strand_detected[i]
        print(f'Using strand specified in the input samples list {strand}, replacing strand_missing with detected strand')
    else:
        warn_if(not all(x is step_strand_detected[0] for x in step_strand_detected), msg = "strands detected are different among samples, please check your protocol, we will use the detected strand for each samples")    
        strand = step_strand_detected
        print(f'Using detected strand for each samples {strand}')
else:
    stop_if(strand not in ["fr", "rf", "unstranded"], msg = "``--strand`` option should be ``fr``, ``rf`` or ``unstranded``")
    print(f'Using ``--strand`` overwrite option for all the samples {strand[0]}') 
    strand = [strand] * len(step_strand_detected)

Step 4. Mark duplicates reads & QC through `Picard`#

This step is the first QC step after STAR alignment. This step will performed QC to collect multipe metrics regarding the RNASeq using Picard. Then it will also generate a new .bam file with duplication marked with the hexadecimal value of 0x0400, which corresponds to a decimal value of 1024

Step Inputs:#

STAR_bam: path to the output in Step 2.
reference_fasta: The fasta reference file used to generate star index, it is critical that the this fasta is exactly the same as those that generate the STAR Index, else this error will occurs: statfungen/xqtl-protocol#357
RefFlat file

This file is needed for picard CollectRnaSeqMetrics module, which in turn

produces metrics describing the distribution of the bases within the transcripts. It calculates the total numbers and the fractions of nucleotides within specific genomic regions including untranslated regions (UTRs), introns, intergenic sequences (between discrete genes), and peptide-coding sequences (exons). This tool also determines the numbers of bases that pass quality filters that are specific to Illumina data (PF_BASES).

The refFlat file can be generated by the reference_data module, RefFlat_generation step.

Step Outputs:#

A collection of metrics file for each of the samples
A new .bam file with duplication marked with the hexadecimal value of 0x0400, which corresponds to a decimal value of 1024

[picard_qc, STAR_align_2]
# Reference gene model
parameter: gtf = path
depends: sos_variable('strand')
# Path to flat reference file, for computing QC metric
parameter: ref_flat = path
# The fasta reference file used to generate star index
parameter: reference_fasta = path
gtf = path(f'{gtf:a}')
ref_flat = path(f'{ref_flat:a}')
reference_fasta = path(f'{reference_fasta:a}')
# For the patterned flowcell models (HiSeq X), change to 2500
parameter: optical_distance = 100
parameter: zap_raw_bam = False
picard_strand_dict = {"rf":"SECOND_READ_TRANSCRIPTION_STRAND","fr": "FIRST_READ_TRANSCRIPTION_STRAND","unstranded":"NONE" }
input: output_from('filter_reads'), group_by = 2, group_with = {"sample_id","strand"}
output: picard_metrics = f'{cwd}/{_sample_id}.alignment_summary_metrics',
        picard_rna_metrics = f'{cwd}/{_sample_id}.rna_metrics',
        md_bam = f'{_input[0]:n}.md.bam',
        md_metrics = f'{_input[0]:n}.md.metrics',
        bigwig = f'{_input[0]:n}.md.bw',
        output_summary = f'{cwd}/{_sample_id}.bam_file_meta'
task: trunk_workers = 1, trunk_size = job_size, walltime = walltime, mem = mem, cores = numThreads
bash: expand= "${ }", stderr = f'{_output[0]:n}.stderr', stdout = f'{_output[0]:n}.stdout', container = container, entrypoint = entrypoint
    bash ${modular_script_dir}/molecular_phenotypes/calling/RNA_calling.sh picard_qc_star_align_2 \
        --cwd "${cwd}" \
        --gtf "${gtf}" \
        --ref-flat "${ref_flat}" \
        --reference-fasta "${reference_fasta}" \
        --java-mem "${java_mem}" \
        --optical-distance ${optical_distance} \
        --input-bam "${_input[0]}" \
        --sample-id "${_sample_id}" \
        --strand "${_strand}" \
        --picard-metrics "${_output["picard_metrics"]}" \
        --picard-rna-metrics "${_output["picard_rna_metrics"]}" \
        --md-bam "${_output["md_bam"]}" \
        --md-metrics "${_output["md_metrics"]}" \
        --bigwig "${_output["bigwig"]}" \
        --output-summary "${_output["output_summary"]}" \
        --var-vcf-file "${varVCFfile}" \
        --zap-raw-bam "${zap_raw_bam}"

[STAR_align_3]
input: group_by = "all"
depends: sos_variable("strand")
output: f'{cwd}/{sample_list:bn}.bam_file_list'
task: trunk_workers = 1, trunk_size = 1, walltime = walltime, mem = mem, cores = 1
bash: expand= "${ }", stderr = f'{_output:n}.stderr', stdout = f'{_output:n}.stdout', container = container, entrypoint = entrypoint
    python3 ${modular_script_dir}/molecular_phenotypes/calling/RNA_calling.py \
        --step star_align_3 \
        --output "${_output}" \
        --input ${" ".join([str(x) for x in _input])} \
        --sample-id ${" ".join([str(x) for x in sample_id])} \
        --strand ${" ".join([str(x) for x in strand])} \
        --var-vcf-file "${varVCFfile}"

Step 5. Post aligment QC through `RNA-SeQC`#

Documentation : RNA-SeQC and Script in docker

This step is second QC step after STAR alignment. It will perform RNA-seq quantification as well.

Step Inputs#

bam_list: path to the output of STAR_output, which outlined the strands and bam file used for each samples.
gtf: reference genome .gtf file, this gtf file need to have the same chr name format as the index used to generate the bam file and must be on collapsed gene gtf

Step Outputs#

The following output files are generated in the output directory you provide:

{sample}.metrics.tsv : A tab-delimited list of (Statistic, Value) pairs of all statistics and metrics recorded.
{sample}.exon_reads.gct : A tab-delimited GCT file with (Exon ID, Gene Name, coverage) tuples for all exons which had at least part of one read mapped.
{sample}.gene_reads.gct : A tab-delimited GCT file with (Gene ID, Gene Name, coverage) tuples for all genes which had at least one read map to at least one of its exons
{sample}.gene_tpm.gct : A tab-delimited GCT file with (Gene ID, Gene Name, TPM) tuples for all genes reported in the gene_reads.gct file. Note: this file is renamed to .gene_rpkm.gct if the –rpkm flag is present.
{sample}.fragmentSizes.txt : A list of fragment sizes recorded, if a BED file was provided
{sample}.coverage.tsv : A tab-delimited list of (Gene ID, Transcript ID, Mean Coverage, Coverage Std, Coverage CV) tuples for all transcripts encountered in the GTF.

[rnaseqc_call_1]
import os
import pandas as pd
parameter: bam_list = path
# Reference gene model
parameter: gtf = path
parameter: reference_fasta = path
bam_list = path(f'{bam_list:a}')
gtf = path(f'{gtf:a}')
reference_fasta = path(f'{reference_fasta:a}')
sample_inv = pd.read_csv(bam_list,sep = "\t")
parameter: detection_threshold = 5
parameter: mapping_quality = 255
## Extract strand information if user have specified the strand
strand_list = sample_inv.strand.values.tolist()
stop_if(not all([x in ["fr", "rf", "unstranded"] for x in strand_list ]), msg = "strand columns should only include ``fr``, ``rf`` or ``unstranded``, please check the bam_list")     
## Extract sample_id
sample_id = sample_inv.sample_id.values.tolist()
    
## Get the file name for cood_bam data
coord_bam_list_inv = sample_inv.coord_bam_list.values.tolist()
coord_bam_list = [x if os.path.isabs(x) else f'{cwd}/{x}' for x in coord_bam_list_inv ]
input:  coord_bam_list, group_by = 1, group_with = {"sample_id","strand_list"}
output: f'{cwd}/{_sample_id}.rnaseqc.gene_tpm.gct.gz',
        f'{cwd}/{_sample_id}.rnaseqc.gene_reads.gct.gz',
        f'{cwd}/{_sample_id}.rnaseqc.exon_reads.gct.gz',
        f'{cwd}/{_sample_id}.rnaseqc.metrics.tsv'

task: trunk_workers = 1, trunk_size = job_size, walltime = walltime, mem = mem, cores = numThreads
bash: expand= "${ }", stderr = f'{_output[0]:n}.stderr', stdout = f'{_output[0]:n}.stdout', container = container, entrypoint = entrypoint
    bash ${modular_script_dir}/molecular_phenotypes/calling/RNA_calling.sh rnaseqc_call \
        --cwd "${cwd}" \
        --sample-list "${sample_list}" \
        --data-dir "${data_dir}" \
        --gtf "${gtf}" \
        --reference-fasta "${reference_fasta}" \
        --input-bam "${_input}" \
        --sample-id "${_sample_id}" \
        --strand "${_strand_list}" \
        --detection-threshold "${detection_threshold}" \
        --mapping-quality "${mapping_quality}" \
        --paired-end ${is_paired_end} \
        --numThreads ${numThreads}

The RNASEQC results were merged in the following step,

[rnaseqc_call_2]
parameter: bam_list = path
input: group_by = "all"
output: f'{cwd}/{bam_list:bn}.rnaseqc.gene_tpm.gct.gz',
        f'{cwd}/{bam_list:bn}.rnaseqc.gene_readsCount.gct.gz',
        f'{cwd}/{bam_list:bn}.rnaseqc.exon_readsCount.gct.gz',
        f'{cwd}/{bam_list:bn}.rnaseqc.metrics.tsv'
task: trunk_workers = 1, trunk_size = job_size, walltime = walltime, mem = mem, cores = numThreads
bash: expand= "${ }", stderr = f'{_output[0]:n}.stderr', stdout = f'{_output[0]:n}.stdout', container = container, entrypoint = entrypoint
    python3 ${modular_script_dir}/molecular_phenotypes/calling/RNA_calling.py \
        --step rnaseqc_merge \
        --cwd "${cwd}" \
        --name "${bam_list:bn}" \
        --input ${" ".join([str(x) for x in _input])}

Step 6. Quantify expression through `RSEM`#

Documentation : RSEM and Script in docker

This step generate the expression matrix from STAR output. Estimate gene and isoform expression from RNA-Seq data are generated.

Step Input#

bam_list: path to the output of STAR_output, which outlined the strands and bam file used for each samples.
transcript-level BAM file: path to the output of Step 3.
RSEM_index: path to RSEM index

Step Outputs#

Please see the output section of https://deweylab.github.io/RSEM/rsem-calculate-expression.html

[rsem_call_1]
parameter: bam_list = path
parameter: RSEM_index = path
bam_list = path(f'{bam_list:a}')
RSEM_index = path(f'{RSEM_index:a}')
parameter: max_frag_len = 1000
parameter: estimate_rspd = True

import os
sample_inv = pd.read_csv(bam_list,sep = "\t")

## Extract strand information if user have specified the strand
strand_list = sample_inv.strand.values.tolist()
stop_if(not all([x in ["fr", "rf", "unstranded"] for x in strand_list ]), msg = "strand columns should only include ``fr``, ``rf`` or ``unstranded``, please check the bam_list")     
## Extract sample_id
sample_id = sample_inv.sample_id.values.tolist()
    
## Get the file name for trans_bam_list data
trans_bam_list_inv = sample_inv.trans_bam_list.values.tolist()
trans_bam_list = [x if os.path.isabs(x) else f'{cwd}/{x}' for x in trans_bam_list_inv]
input: trans_bam_list, group_by = 1, group_with = {"sample_id", "strand_list"} 
output: f'{cwd}/{_sample_id}.rsem.isoforms.results', f'{cwd}/{_sample_id}.rsem.genes.results',f'{cwd}/{_sample_id}.rsem.stat/{_sample_id}.rsem.cnt'
task: trunk_workers = 1, walltime = walltime, mem = mem, cores = numThreads, trunk_size = job_size
bash: container=container, expand= "${ }", stderr = f'{_output[0]:n}.stderr', stdout = f'{_output[0]:n}.stdout', entrypoint=entrypoint
    run_RSEM ${RSEM_index:a} ${_input:a} ${_sample_id} \
        -o ${_output[0]:d} \
        --max_frag_len ${max_frag_len} \
        --estimate_rspd ${'true' if estimate_rspd else 'false'} \
        --paired_end ${"true" if is_paired_end else "false"} \
        --is_stranded ${"true" if _strand_list != "unstranded" else "false"} \
        --threads ${numThreads}

The RSEM results were merged in the following steps, seven files (four for each columns in the isoform output and 3 for each of the genes output) will be generated.

[rsem_call_2]
parameter: bam_list = path
input: group_by = "all"
output: f'{cwd}/{bam_list:bn}.rsem_transcripts_expected_count.txt.gz',
        f'{cwd}/{bam_list:bn}.rsem_transcripts_tpm.txt.gz',
        f'{cwd}/{bam_list:bn}.rsem_transcripts_fpkm.txt.gz',
        f'{cwd}/{bam_list:bn}.rsem_transcripts_isopct.txt.gz',
        f'{cwd}/{bam_list:bn}.rsem_genes_expected_count.txt.gz',
        f'{cwd}/{bam_list:bn}.rsem_genes_tpm.txt.gz',
        f'{cwd}/{bam_list:bn}.rsem_genes_fpkm.txt.gz',
        f'{cwd}/{bam_list:bn}.rsem.aggregated_quality.metrics.tsv'

task: trunk_workers = 1, walltime = walltime, mem = mem, cores = numThreads, trunk_size = job_size
bash: expand= "${ }", stderr = f'{_output[0]:n}.stderr', stdout = f'{_output[0]:n}.stdout', container = container, entrypoint = entrypoint
    python3 ${modular_script_dir}/molecular_phenotypes/calling/RNA_calling.py \
        --step rsem_call_2 \
        --cwd "${cwd}" \
        --name "${bam_list:bn}" \
        --input ${" ".join([str(x) for x in _input])}

Step 7. summarize with MultiQC#

MultiQC

MultiQC is a reporting tool that parses summary statistics from results and log files generated by other bioinformatics tools. MultiQC doesn’t run other tools for you - it’s designed to be placed at the end of analysis pipelines or to be run manually when you’ve finished running your tools. When you launch MultiQC, it recursively searches through any provided file paths and finds files that it recognises. It parses relevant information from these and generates a single stand-alone HTML report file. It also saves a directory of data files with all parsed data for further downstream use.

MultiQC will automatically generate QC report for anything embedded within the given directory. Therefore providing the directory containing all the output will surfice.

The output of MultiQC is a multi-module report each corresponding to the quality report of each step of analysis previously performed.

[rsem_call_3,rnaseqc_call_3]
output: f'{_input[0]:nnn}.multiqc_report.html'
task: trunk_workers = 1, walltime = walltime, mem = mem, cores = numThreads, trunk_size = job_size
bash: expand= "${ }", stderr = f'{_output:n}.stderr', stdout = f'{_output:n}.stdout', container = container, entrypoint = entrypoint
    bash ${modular_script_dir}/molecular_phenotypes/calling/RNA_calling.sh multiqc_report \
        --cwd "${cwd}" \
        --input-dir "${_input:d}" \
        --output-report "${_output}" \
        --multiqc-config "${_output:n}.multiqc_config.yml"

[rsem_call_4, rnaseqc_call_4]
# Path to flat reference file, for computing QC metric
output: f'{_input[0]:nn}.picard.aggregated_quality.metrics.tsv'# it will be outputed in a very 'deep' directory with {cwd}/
task: trunk_workers = 1, walltime = walltime, mem = mem, cores = numThreads, trunk_size = job_size
bash: expand= "${ }", stderr = f'{_output:n}.stderr', stdout = f'{_output:n}.stdout', container = container, entrypoint = entrypoint
    Rscript ${modular_script_dir}/molecular_phenotypes/calling/RNA_calling.R \
        --step aggregate_picard_qc \
        --input-dir "${_input[-1]:d}" \
        --output "${_output}" \
        --is-paired-end ${is_paired_end} \
        ${"--wasp" if varVCFfile else ""} \
        --numThreads ${numThreads}

Step 8. Filter Reads with WASP#

This optional step in the STAR alignment process offers filtration based on WASP correction and unique read identification. WASP correction mitigates reference allele bias in RNA-seq data by eliminating reads with specific WASP flags. Unique read filtering retains only uniquely mapped reads, identified by a mapping quality (MAPQ) value of 255, as per GTEx Consortium standards. These filters can be applied individually or in combination to enhance alignment quality and prepare BAM files for different analysis.

Step Inputs:#

No input needed to specify manually. All files are from the output from STAR_align_1

Step Outputs:#

A collection of filtered(or not) .bam files

Anticipated Results#

The pipeline produces output files in the output/ subdirectory named after the workflow step. Verify success by checking that output files exist and are non-empty. See the Output section above for the expected file names and formats.

[filter_reads]
parameter: unique=""
parameter: wasp=""
parameter: mapping_quality = 255
depends: sos_variable('strand')
input: output_from("STAR_align_1"), group_by = 2, group_with = {"sample_id","strand"}
output: cord_bam_wasp_qc = f'{cwd}/{_sample_id}.Aligned.sortedByCoord.out{"_wasp" if wasp else "_nowasp"}{"_qc" if wasp else "_noqc"}.bam',
        trans_bam_wasp_qc = f'{cwd}/{_sample_id}.Aligned.toTranscriptome.out{"_wasp" if wasp else "_nowasp"}{"_qc" if wasp else "_noqc"}.bam'
task: trunk_workers = 1, trunk_size = job_size, walltime = walltime, mem = mem, cores = numThreads
bash: expand= "${ }", stderr = f'{_output[0]:n}.stderr', stdout = f'{_output[0]:n}.stdout', container = container, entrypoint = entrypoint
    bash ${modular_script_dir}/molecular_phenotypes/calling/RNA_calling.sh filter_reads \
        --cwd "${cwd}" \
        --input-cord-bam "${_input[0]}" \
        --input-trans-bam "${_input[1]}" \
        --output-cord-bam "${_output["cord_bam_wasp_qc"]}" \
        --output-trans-bam "${_output["trans_bam_wasp_qc"]}" \
        --unique "${unique}" \
        --wasp "${wasp}" \
        --mapping-quality "${mapping_quality}"

Quantifying expression from RNA-seq data

Contents

Quantifying expression from RNA-seq data#

Description#

Input#

Output#

Input Files#

Minimal Working Example#

Quality control with fastqc#

Trim adaptors with fastp (optional)#

Align reads with STAR#

Gene-level expression with rnaseqc#

Transcript-level expression with RSEM#

Output Files#

Troubleshooting#

Command interface#

Setup and global parameters#

Step 0. Convert BAM back to fastq if necessary#

Step 1. QC before alignment#

Step Inputs#

Step Outputs#

Step 2. Remove adaptor through fastp#

Step Inputs#

Step Outputs#

Step options#

Step 2 Alternative: Remove adaptor through Trimmomatic#

Step Inputs#

Step Outputs#

Step 3. Alignment through STAR#

Step Inputs#

Step Outputs#

Step 4. Mark duplicates reads & QC through Picard#

Step Inputs:#

Step Outputs:#

Step 5. Post aligment QC through RNA-SeQC#

Step Inputs#

Step Outputs#

Step 6. Quantify expression through RSEM#

Step Input#

Step Outputs#

Step 7. summarize with MultiQC#

Step 8. Filter Reads with WASP#

Step Inputs:#

Step Outputs:#

Anticipated Results#

Quality control with `fastqc`#

Trim adaptors with `fastp` (optional)#

Align reads with `STAR`#

Gene-level expression with `rnaseqc`#

Transcript-level expression with `RSEM`#

Step 2. Remove adaptor through `fastp`#

Step 2 Alternative: Remove adaptor through `Trimmomatic`#

Step 3. Alignment through `STAR`#

Step 4. Mark duplicates reads & QC through `Picard`#

Step 5. Post aligment QC through `RNA-SeQC`#

Step 6. Quantify expression through `RSEM`#