Reference Data#
This miniprotocol walks through how the reference data used throughout the xQTL protocol is downloaded, formatted and indexed. It is an introductory, top-to-bottom guide: each step calls a single workflow from the reference_data_preparation.ipynb module (plus VCF_QC.ipynb for dbSNP, and two interactive notebooks for the TAD and LD-block references).
Miniprotocol timing: ~4 hours, dominated by the genome download and the STAR/RSEM indexing steps.
Description#
The miniprotocol chains three modules:
reference_data_preparation.ipynb(steps i-viii): download and format the reference genome, gene annotations, ERCC spike-in reference, dbSNP variants, and build the STAR / RSEM / RefFlat / SUPPA indices.generalized_TADB.ipynb(step ix): generate topologically associated domain (TAD) files and their boundaries.notebook_for_LD_block_reference_panel.ipynb(step x): produce LD blocks and the reference panel.
Input Files#
The download steps need only internet access; nothing local is required to begin. The later formatting and indexing steps consume the products of the preceding steps under the reference_data/ working directory.
File |
Description |
|---|---|
|
Reference genome FASTA (downloaded) |
|
Ensembl gene annotation GTF (downloaded) |
|
ERCC spike-in reference (downloaded) |
|
dbSNP variant reference (downloaded) |
Steps#
1. Download Reference Data#
Download the human reference genome, the Ensembl gene annotation, the ERCC spike-in reference, and the dbSNP variant file into the reference_data directory.
Timing: ~varies by dataset on typical compute infrastructure.
Output Files#
File |
Description |
|---|---|
|
Formatted genome reference |
|
Formatted gene annotation |
|
STAR aligner index |
|
RSEM quantification index |
|
Picard-compatible refFlat annotation |
|
SUPPA annotation for psichomics |
|
Topologically associated domain windows |
|
LD blocks and reference panel |
sos run pipeline/reference_data_preparation.ipynb download_hg_reference --cwd output/reference_data
sos run pipeline/reference_data_preparation.ipynb download_gene_annotation --cwd output/reference_data
sos run pipeline/reference_data_preparation.ipynb download_ercc_reference --cwd output/reference_data
sos run pipeline/reference_data_preparation.ipynb download_dbsnp --cwd output/reference_data
2. Format Reference Data#
Merge the human genome with the ERCC spike-in sequences and standardise the FASTA so downstream tools share one consistent reference.
sos run pipeline/reference_data_preparation.ipynb hg_reference \
--cwd reference_data \
--ercc-reference reference_data/ERCC92.fa \
--hg-reference reference_data/GRCh38_full_analysis_set_plus_decoy_hla.fa
3. Format Gene Feature Data#
Reformat the gene annotation (and append the ERCC features) to produce the collapsed and full GTFs used for expression quantification.
sos run pipeline/reference_data_preparation.ipynb gene_annotation \
--cwd reference_data \
--ercc-gtf reference_data/ERCC92.gtf \
--hg-gtf reference_data/Homo_sapiens.GRCh38.103.chr.gtf \
--hg-reference reference_data/GRCh38_full_analysis_set_plus_decoy_hla.noALT_noHLA_noDecoy.fasta \
--stranded
4. Generate STAR Index#
Build the STAR genome index used by the RNA-seq alignment step. This is compute- and memory-intensive.
sos run pipeline/reference_data_preparation.ipynb STAR_index \
--cwd reference_data \
--hg-reference reference_data/GRCh38_full_analysis_set_plus_decoy_hla.noALT_noHLA_noDecoy_ERCC.fasta
5. Generate RSEM Index#
Build the RSEM index used for transcript-level expression quantification.
sos run pipeline/reference_data_preparation.ipynb RSEM_index \
--cwd reference_data \
--hg-reference reference_data/GRCh38_full_analysis_set_plus_decoy_hla.noALT_noHLA_noDecoy_ERCC.fasta \
--hg-gtf reference_data/Homo_sapiens.GRCh38.103.chr.reformatted.ERCC.gtf
6. Generate RefFlat Annotation for Picard#
Generate the RefFlat-format annotation used by Picard for RNA-seq QC metrics.
sos run pipeline/reference_data_preparation.ipynb RefFlat_generation \
--cwd reference_data \
--hg-gtf reference_data/Homo_sapiens.GRCh38.103.chr.reformatted.ERCC.gtf
7. Generate SUPPA Annotation for Psichomics#
Generate the SUPPA splicing-event annotation consumed by the psichomics splicing workflow.
sos run pipeline/reference_data_preparation.ipynb SUPPA_annotation \
--cwd reference_data \
--hg-gtf reference_data/Homo_sapiens.GRCh38.103.chr.reformatted.ERCC.gtf
8. Extract rsIDs for Known Variants#
Annotate the genotype VCF with dbSNP rsIDs. This optional step is implemented in the VCF_QC.ipynb module.
sos run pipeline/VCF_QC.ipynb dbsnp_annotate \
--genoFile reference_data/00-All.vcf.gz
9. Generate Topologically Associated Domains#
TAD files and their boundaries are produced with the interactive generalized_TAD.ipynb notebook.
# interactive notebook
generalized_TAD.ipynb
10. Produce LD Blocks and Reference Panel#
LD blocks and the reference panel are produced with the interactive notebook_for_LD_block_reference_panel.ipynb notebook. See the linked documentation for the interactive walk-through.
Anticipated Results#
The pipeline uses the following reference data for RNA-seq expression quantification:
GRCh38_full_analysis_set_plus_decoy_hla.noALT_noHLA_noDecoy_ERCC.{dict,fasta,fasta.fai}Homo_sapiens.GRCh38.103.chr.reformatted.collapse_only.gene.ERCC.gtffor the stranded protocol, andHomo_sapiens.GRCh38.103.chr.reformatted.gene.ERCC.gtffor the unstranded protocol.Everything under the
STAR_Indexfolder.Everything under the
RSEM_Indexfolder.Optionally, for quality control,
gtf_ref.flat.
For alternative splicing: Homo_sapiens.GRCh38.103.chr.reformatted.ERCC.SUPPA_annotation.rds (psichomics).
For topologically associated domains and boundary files: generalized_TAD.tsv, generalized_TADB.tsv, TADB_enhanced_cis.bed, and extended_TADB.bed.
Command interface#
List the workflows and parameters available in the reference data preparation module.
sos run pipeline/reference_data_preparation.ipynb -h