RNA-seq expression#
Input Files#
File |
Description |
|---|---|
|
Raw RNA-seq FASTQ files (toy dataset) |
|
Sample sheet listing one FASTQ (pair) per sample |
|
STAR genome index used for alignment |
Miniprotocol Timing#
Timing <3.5 hours
Description#
This miniprotocol shows the use of various modules to prepare reference data, perform RNA-seq calling, quantify expression, conduct quality control and normalize data. The modules are as follows:
RNA_calling.ipynb(steps 1-5): Quantifying expression from RNA-seq databulk_expression_QC.ipynb(step 6): Sample level RNA-seq quality controlbulk_expression_normalization.ipynb(step 7): Bulk RNA-seq counts normalization
Toy-data run notes#
These examples run against the bundled toy dataset. Two adjustments were made versus the original protocol notebook so the examples run out-of-the-box on the toy data: the optional --varVCFfile (WASP allele-specific filtering) is omitted, since it is not required for eQTL expression quantification and the toy reference does not ship a sample variants VCF; and the QC/normalization input is the toy input/rnaseq/protocol_example.rnaseq.tpm_matrix.bed (the original referenced a tmp filename).
Verified end-to-end on the toy data: steps 1 (fastqc), 2 (fastp_trim_adaptor), 6 (qc), and 7 (normalize) complete successfully, with step vii producing the final TensorQTL-ready ...tmm_cpm_voom.expression.bed.gz. Steps 3 (STAR_align), 4 (rnaseqc_call), and 5 (rsem_call) require a complete STAR genome index; the bundled reference_data/STAR_Index is incomplete (it contains only the suffix-array shards and is missing genomeParameters.txt/Genome), so STAR exits with a FATAL ERROR. To run those steps, first build a STAR index from the reference fasta+gtf (via the RNA_calling.ipynb index workflow) and point --STAR-index to it. The RNA-calling FASTQ inputs are toy-sized: each sample (SAMPLE_001, SAMPLE_002) was downsampled to the first 100,000 read-pairs (~7 MB per gzipped file) so the bundle stays small; they are listed in input/rnaseq/protocol_example.rnaseq.fastq.list.txt and resolved from input/rnaseq/protocol_example.rnaseq.fastq/. Note that the RNA-calling steps (FASTQ QC, trimming, alignment) are a small 2-sample demo (SAMPLE_001, SAMPLE_002); the downstream expression matrices used for QC and normalization cover the full 60-sample toy cohort (SAMPLE_001-SAMPLE_060). The two are independent toy inputs, so there is no need for 60 FASTQ files.
Steps#
1. Perform data quality summary via fastqc#
sos run pipeline/RNA_calling.ipynb fastqc \
--cwd output/rnaseq/fastqc \
--sample-list input/rnaseq/protocol_example.rnaseq.fastq.list.txt \
--data-dir input/rnaseq/protocol_example.rnaseq.fastq
2. Cut adaptor (Optional)#
sos run pipeline/RNA_calling.ipynb fastp_trim_adaptor \
--cwd output/rnaseq --sample-list input/rnaseq/protocol_example.rnaseq.fastq.list.txt \
--data-dir input/rnaseq/protocol_example.rnaseq.fastq --STAR-index reference_data/STAR_Index/ \
--gtf reference_data/Homo_sapiens.GRCh38.103.chr.reformatted.ERCC.gtf \
--reference-fasta reference_data/GRCh38_full_analysis_set_plus_decoy_hla.noALT_noHLA_noDecoy_ERCC.fasta \
--ref-flat reference_data/Homo_sapiens.GRCh38.103.chr.reformatted.ERCC.ref.flat
3. Read alignment via STAR and QC via Picard#
sos run pipeline/RNA_calling.ipynb STAR_align \
--cwd output/rnaseq/bam --sample-list input/rnaseq/protocol_example.rnaseq.fastq.list.txt \
--data-dir input/rnaseq/protocol_example.rnaseq.fastq --STAR-index reference_data/STAR_Index/ \
--gtf reference_data/Homo_sapiens.GRCh38.103.chr.reformatted.ERCC.gtf \
--reference-fasta reference_data/GRCh38_full_analysis_set_plus_decoy_hla.noALT_noHLA_noDecoy_ERCC.fasta \
--ref-flat reference_data/Homo_sapiens.GRCh38.103.chr.reformatted.ERCC.ref.flat \
--chimSegmentMin 0 \
-J 50 --mem 200G --numThreads 8
4. Call gene-level RNA expression via rnaseqc#
sos run pipeline/RNA_calling.ipynb rnaseqc_call \
--cwd output/rnaseq/bam \
--sample-list input/rnaseq/protocol_example.rnaseq.fastq.list.txt \
--data-dir input/rnaseq/protocol_example.rnaseq.fastq \
--gtf reference_data/Homo_sapiens.GRCh38.103.chr.reformatted.collapse_only.gene.gtf \
--reference-fasta reference_data/GRCh38_full_analysis_set_plus_decoy_hla.noALT_noHLA_noDecoy_ERCC.fasta \
--bam_list input/rnaseq/protocol_example.rnaseq.bam/sample_bam_list.txt
5. Call transcript level RNA expression via RSEM#
sos run pipeline/RNA_calling.ipynb rsem_call \
--cwd output/rnaseq/bam \
--sample-list input/rnaseq/protocol_example.rnaseq.fastq.list.txt \
--data-dir input/rnaseq/protocol_example.rnaseq.fastq \
--STAR-index reference_data/STAR_Index/ \
--gtf reference_data/Homo_sapiens.GRCh38.103.chr.reformatted.ERCC.gtf \
--reference-fasta reference_data/GRCh38_full_analysis_set_plus_decoy_hla.noALT_noHLA_noDecoy_ERCC.fasta \
--ref-flat reference_data/Homo_sapiens.GRCh38.103.chr.reformatted.ERCC.ref.flat \
--bam_list input/rnaseq/protocol_example.rnaseq.bam/sample_bam_list.txt \
--RSEM-index reference_data/RSEM_Index
6. Multi-sample RNA-seq QC#
sos run pipeline/bulk_expression_QC.ipynb qc \
--cwd output/rnaseq \
--tpm-gct input/rnaseq/protocol_example.rnaseq.tpm_matrix.bed \
--counts-gct input/rnaseq/protocol_example.rnaseq.count_matrix.bed
7. Multi-sample read count normalization#
sos run pipeline/bulk_expression_normalization.ipynb normalize \
--cwd output/rnaseq \
--tpm-gct output/rnaseq/protocol_example.low_expression_filtered.outlier_removed.tpm.gct.gz \
--counts-gct output/rnaseq/protocol_example.low_expression_filtered.outlier_removed.geneCount.gct.gz \
--annotation-gtf reference_data/Homo_sapiens.GRCh38.103.chr.reformatted.collapse_only.gene.ERCC.gtf \
--count-threshold 1 --sample_participant_lookup input/rnaseq/protocol_example.rnaseq.sample_participant_lookup.txt
Anticipated Results#
Output Files#
File |
Description |
|---|---|
|
Per-sample FastQC quality reports |
|
STAR-aligned, Picard-QC BAM files |
|
Normalized gene/transcript expression matrix |
The final output contained QCed and normalized expression data in a bed.gz file. This file is ready for use in TensorQTL.
%cd /restricted/projectnb/xqtl/jaempawi/xqtl_protocol/toy_example
Figure 1A. Bulk RNA-Seq Quality Control D-Statistic Distribution.
Figure 1B. Bulk RNA-Seq Quality Control Relative Log Expression Residuals.
Figure 1C. Bulk RNA-Seq Quality Control Mahalanobis Distance P-Value Clustering.