Reference Data

Reference Data#

This miniprotocol walks through how the reference data used throughout the xQTL protocol is downloaded, formatted and indexed. It is an introductory, top-to-bottom guide: each step calls a single workflow from the reference_data_preparation.ipynb module (plus VCF_QC.ipynb for dbSNP, and two interactive notebooks for the TAD and LD-block references).

Miniprotocol timing: ~4 hours, dominated by the genome download and the STAR/RSEM indexing steps.

Description#

The miniprotocol chains three modules:

reference_data_preparation.ipynb (steps i-viii): download and format the reference genome, gene annotations, ERCC spike-in reference, dbSNP variants, and build the STAR / RSEM / RefFlat / SUPPA indices.
generalized_TADB.ipynb (step ix): generate topologically associated domain (TAD) files and their boundaries.
notebook_for_LD_block_reference_panel.ipynb (step x): produce LD blocks and the reference panel.

Input Files#

The download steps need only internet access; nothing local is required to begin. The later formatting and indexing steps consume the products of the preceding steps under the reference_data/ working directory.

File	Description
`GRCh38_full_analysis_set_plus_decoy_hla.fa`	Reference genome FASTA (downloaded)
`Homo_sapiens.GRCh38.103.chr.gtf`	Ensembl gene annotation GTF (downloaded)
`ERCC92.fa` / `ERCC92.gtf`	ERCC spike-in reference (downloaded)
`00-All.vcf.gz`	dbSNP variant reference (downloaded)

Steps#

1. Download Reference Data #

Download the human reference genome, the Ensembl gene annotation, the ERCC spike-in reference, and the dbSNP variant file into the reference_data directory.

Timing: ~varies by dataset on typical compute infrastructure.

Output Files#

File	Description
`reference_data/*.reference.fasta`	Formatted genome reference
`reference_data/*.gtf` (and gene feature files)	Formatted gene annotation
`reference_data/STAR_Index/`	STAR aligner index
`reference_data/RSEM_Index/`	RSEM quantification index
`reference_data/*.refFlat`	Picard-compatible refFlat annotation
`reference_data/*_suppa_annotation`	SUPPA annotation for psichomics
`reference_data/TAD/*.bed`	Topologically associated domain windows
`reference_data/LD_blocks/`	LD blocks and reference panel

sos run pipeline/reference_data_preparation.ipynb download_hg_reference --cwd output/reference_data
sos run pipeline/reference_data_preparation.ipynb download_gene_annotation --cwd output/reference_data
sos run pipeline/reference_data_preparation.ipynb download_ercc_reference --cwd output/reference_data
sos run pipeline/reference_data_preparation.ipynb download_dbsnp --cwd output/reference_data

2. Format Reference Data #

Merge the human genome with the ERCC spike-in sequences and standardise the FASTA so downstream tools share one consistent reference.

sos run pipeline/reference_data_preparation.ipynb hg_reference \
    --cwd reference_data \
    --ercc-reference reference_data/ERCC92.fa \
    --hg-reference reference_data/GRCh38_full_analysis_set_plus_decoy_hla.fa

3. Format Gene Feature Data #

Reformat the gene annotation (and append the ERCC features) to produce the collapsed and full GTFs used for expression quantification.

sos run pipeline/reference_data_preparation.ipynb gene_annotation \
    --cwd reference_data \
    --ercc-gtf reference_data/ERCC92.gtf \
    --hg-gtf reference_data/Homo_sapiens.GRCh38.103.chr.gtf \
    --hg-reference reference_data/GRCh38_full_analysis_set_plus_decoy_hla.noALT_noHLA_noDecoy.fasta \
    --stranded

4. Generate STAR Index #

Build the STAR genome index used by the RNA-seq alignment step. This is compute- and memory-intensive.

sos run pipeline/reference_data_preparation.ipynb STAR_index \
    --cwd reference_data \
    --hg-reference reference_data/GRCh38_full_analysis_set_plus_decoy_hla.noALT_noHLA_noDecoy_ERCC.fasta

5. Generate RSEM Index #

Build the RSEM index used for transcript-level expression quantification.

sos run pipeline/reference_data_preparation.ipynb RSEM_index \
    --cwd reference_data \
    --hg-reference reference_data/GRCh38_full_analysis_set_plus_decoy_hla.noALT_noHLA_noDecoy_ERCC.fasta \
    --hg-gtf reference_data/Homo_sapiens.GRCh38.103.chr.reformatted.ERCC.gtf

6. Generate RefFlat Annotation for Picard #

Generate the RefFlat-format annotation used by Picard for RNA-seq QC metrics.

sos run pipeline/reference_data_preparation.ipynb RefFlat_generation \
    --cwd reference_data \
    --hg-gtf reference_data/Homo_sapiens.GRCh38.103.chr.reformatted.ERCC.gtf

7. Generate SUPPA Annotation for Psichomics #

Generate the SUPPA splicing-event annotation consumed by the psichomics splicing workflow.

sos run pipeline/reference_data_preparation.ipynb SUPPA_annotation \
    --cwd reference_data \
    --hg-gtf reference_data/Homo_sapiens.GRCh38.103.chr.reformatted.ERCC.gtf

8. Extract rsIDs for Known Variants #

Annotate the genotype VCF with dbSNP rsIDs. This optional step is implemented in the VCF_QC.ipynb module.

sos run pipeline/VCF_QC.ipynb dbsnp_annotate \
    --genoFile reference_data/00-All.vcf.gz

9. Generate Topologically Associated Domains #

TAD files and their boundaries are produced with the interactive generalized_TAD.ipynb notebook.

# interactive notebook
generalized_TAD.ipynb

10. Produce LD Blocks and Reference Panel #

LD blocks and the reference panel are produced with the interactive notebook_for_LD_block_reference_panel.ipynb notebook. See the linked documentation for the interactive walk-through.

Anticipated Results#

The pipeline uses the following reference data for RNA-seq expression quantification:

GRCh38_full_analysis_set_plus_decoy_hla.noALT_noHLA_noDecoy_ERCC.{dict,fasta,fasta.fai}
Homo_sapiens.GRCh38.103.chr.reformatted.collapse_only.gene.ERCC.gtf for the stranded protocol, and Homo_sapiens.GRCh38.103.chr.reformatted.gene.ERCC.gtf for the unstranded protocol.
Everything under the STAR_Index folder.
Everything under the RSEM_Index folder.
Optionally, for quality control, gtf_ref.flat.

For alternative splicing: Homo_sapiens.GRCh38.103.chr.reformatted.ERCC.SUPPA_annotation.rds (psichomics).

For topologically associated domains and boundary files: generalized_TAD.tsv, generalized_TADB.tsv, TADB_enhanced_cis.bed, and extended_TADB.bed.

Command interface#

List the workflows and parameters available in the reference data preparation module.

sos run pipeline/reference_data_preparation.ipynb -h

Reference Data

Contents

Reference Data#

Description#

Input Files#

Steps#

1. Download Reference Data#

Output Files#

2. Format Reference Data#

3. Format Gene Feature Data#

4. Generate STAR Index#

5. Generate RSEM Index#

6. Generate RefFlat Annotation for Picard#

7. Generate SUPPA Annotation for Psichomics#

8. Extract rsIDs for Known Variants#

9. Generate Topologically Associated Domains#

10. Produce LD Blocks and Reference Panel#