RNA-seq expression#

Input Files#

File

Description

input/rnaseq/protocol_example.rnaseq.fastq

Raw RNA-seq FASTQ files (toy dataset)

input/rnaseq/protocol_example.rnaseq.fastq.list.txt

Sample sheet listing one FASTQ (pair) per sample

reference_data/STAR_Index/

STAR genome index used for alignment

Miniprotocol Timing#

Timing <3.5 hours

Description#

This miniprotocol shows the use of various modules to prepare reference data, perform RNA-seq calling, quantify expression, conduct quality control and normalize data. The modules are as follows:

  1. RNA_calling.ipynb (steps 1-5): Quantifying expression from RNA-seq data

  2. bulk_expression_QC.ipynb (step 6): Sample level RNA-seq quality control

  3. bulk_expression_normalization.ipynb (step 7): Bulk RNA-seq counts normalization

Toy-data run notes#

These examples run against the bundled toy dataset. Two adjustments were made versus the original protocol notebook so the examples run out-of-the-box on the toy data: the optional --varVCFfile (WASP allele-specific filtering) is omitted, since it is not required for eQTL expression quantification and the toy reference does not ship a sample variants VCF; and the QC/normalization input is the toy input/rnaseq/protocol_example.rnaseq.tpm_matrix.bed (the original referenced a tmp filename).

Verified end-to-end on the toy data: steps 1 (fastqc), 2 (fastp_trim_adaptor), 6 (qc), and 7 (normalize) complete successfully, with step vii producing the final TensorQTL-ready ...tmm_cpm_voom.expression.bed.gz. Steps 3 (STAR_align), 4 (rnaseqc_call), and 5 (rsem_call) require a complete STAR genome index; the bundled reference_data/STAR_Index is incomplete (it contains only the suffix-array shards and is missing genomeParameters.txt/Genome), so STAR exits with a FATAL ERROR. To run those steps, first build a STAR index from the reference fasta+gtf (via the RNA_calling.ipynb index workflow) and point --STAR-index to it. The RNA-calling FASTQ inputs are toy-sized: each sample (SAMPLE_001, SAMPLE_002) was downsampled to the first 100,000 read-pairs (~7 MB per gzipped file) so the bundle stays small; they are listed in input/rnaseq/protocol_example.rnaseq.fastq.list.txt and resolved from input/rnaseq/protocol_example.rnaseq.fastq/. Note that the RNA-calling steps (FASTQ QC, trimming, alignment) are a small 2-sample demo (SAMPLE_001, SAMPLE_002); the downstream expression matrices used for QC and normalization cover the full 60-sample toy cohort (SAMPLE_001-SAMPLE_060). The two are independent toy inputs, so there is no need for 60 FASTQ files.

Steps#

1. Perform data quality summary via fastqc#

sos run pipeline/RNA_calling.ipynb fastqc \
    --cwd output/rnaseq/fastqc \
    --sample-list input/rnaseq/protocol_example.rnaseq.fastq.list.txt \
    --data-dir input/rnaseq/protocol_example.rnaseq.fastq 

2. Cut adaptor (Optional)#

sos run pipeline/RNA_calling.ipynb fastp_trim_adaptor \
    --cwd output/rnaseq --sample-list input/rnaseq/protocol_example.rnaseq.fastq.list.txt \
    --data-dir input/rnaseq/protocol_example.rnaseq.fastq --STAR-index reference_data/STAR_Index/ \
    --gtf reference_data/Homo_sapiens.GRCh38.103.chr.reformatted.ERCC.gtf \
    --reference-fasta reference_data/GRCh38_full_analysis_set_plus_decoy_hla.noALT_noHLA_noDecoy_ERCC.fasta \
    --ref-flat reference_data/Homo_sapiens.GRCh38.103.chr.reformatted.ERCC.ref.flat 

3. Read alignment via STAR and QC via Picard#

sos run pipeline/RNA_calling.ipynb STAR_align \
    --cwd output/rnaseq/bam --sample-list input/rnaseq/protocol_example.rnaseq.fastq.list.txt \
    --data-dir input/rnaseq/protocol_example.rnaseq.fastq --STAR-index reference_data/STAR_Index/ \
    --gtf reference_data/Homo_sapiens.GRCh38.103.chr.reformatted.ERCC.gtf \
    --reference-fasta reference_data/GRCh38_full_analysis_set_plus_decoy_hla.noALT_noHLA_noDecoy_ERCC.fasta \
    --ref-flat reference_data/Homo_sapiens.GRCh38.103.chr.reformatted.ERCC.ref.flat \
    --chimSegmentMin 0 \
    -J 50 --mem 200G --numThreads 8 

4. Call gene-level RNA expression via rnaseqc#

sos run pipeline/RNA_calling.ipynb rnaseqc_call \
    --cwd output/rnaseq/bam \
    --sample-list input/rnaseq/protocol_example.rnaseq.fastq.list.txt \
    --data-dir input/rnaseq/protocol_example.rnaseq.fastq \
    --gtf reference_data/Homo_sapiens.GRCh38.103.chr.reformatted.collapse_only.gene.gtf \
    --reference-fasta reference_data/GRCh38_full_analysis_set_plus_decoy_hla.noALT_noHLA_noDecoy_ERCC.fasta \
    --bam_list input/rnaseq/protocol_example.rnaseq.bam/sample_bam_list.txt 

5. Call transcript level RNA expression via RSEM#

sos run pipeline/RNA_calling.ipynb rsem_call \
    --cwd output/rnaseq/bam \
    --sample-list input/rnaseq/protocol_example.rnaseq.fastq.list.txt \
    --data-dir input/rnaseq/protocol_example.rnaseq.fastq \
    --STAR-index reference_data/STAR_Index/ \
    --gtf reference_data/Homo_sapiens.GRCh38.103.chr.reformatted.ERCC.gtf \
    --reference-fasta reference_data/GRCh38_full_analysis_set_plus_decoy_hla.noALT_noHLA_noDecoy_ERCC.fasta \
    --ref-flat reference_data/Homo_sapiens.GRCh38.103.chr.reformatted.ERCC.ref.flat \
    --bam_list input/rnaseq/protocol_example.rnaseq.bam/sample_bam_list.txt \
    --RSEM-index reference_data/RSEM_Index 

6. Multi-sample RNA-seq QC#

sos run pipeline/bulk_expression_QC.ipynb qc \
    --cwd output/rnaseq \
    --tpm-gct input/rnaseq/protocol_example.rnaseq.tpm_matrix.bed \
    --counts-gct input/rnaseq/protocol_example.rnaseq.count_matrix.bed 

7. Multi-sample read count normalization#

sos run pipeline/bulk_expression_normalization.ipynb normalize \
    --cwd output/rnaseq \
    --tpm-gct output/rnaseq/protocol_example.low_expression_filtered.outlier_removed.tpm.gct.gz \
    --counts-gct output/rnaseq/protocol_example.low_expression_filtered.outlier_removed.geneCount.gct.gz \
    --annotation-gtf reference_data/Homo_sapiens.GRCh38.103.chr.reformatted.collapse_only.gene.ERCC.gtf  \
    --count-threshold 1 --sample_participant_lookup input/rnaseq/protocol_example.rnaseq.sample_participant_lookup.txt 

Anticipated Results#

Output Files#

File

Description

output/rnaseq/fastqc/

Per-sample FastQC quality reports

output/rnaseq/bam/

STAR-aligned, Picard-QC BAM files

output/rnaseq/*.bed.gz

Normalized gene/transcript expression matrix

The final output contained QCed and normalized expression data in a bed.gz file. This file is ready for use in TensorQTL.

%cd /restricted/projectnb/xqtl/jaempawi/xqtl_protocol/toy_example

Figure 1A. Bulk RNA-Seq Quality Control D-Statistic Distribution.

Figure 1B. Bulk RNA-Seq Quality Control Relative Log Expression Residuals.

Figure 1C. Bulk RNA-Seq Quality Control Mahalanobis Distance P-Value Clustering.