Reference Data#

This miniprotocol walks through how the reference data used throughout the xQTL protocol is downloaded, formatted and indexed. It is an introductory, top-to-bottom guide: each step calls a single workflow from the reference_data_preparation.ipynb module (plus VCF_QC.ipynb for dbSNP, and two interactive notebooks for the TAD and LD-block references).

Miniprotocol timing: ~4 hours, dominated by the genome download and the STAR/RSEM indexing steps.

Description#

The miniprotocol chains three modules:

  1. reference_data_preparation.ipynb (steps i-viii): download and format the reference genome, gene annotations, ERCC spike-in reference, dbSNP variants, and build the STAR / RSEM / RefFlat / SUPPA indices.

  2. generalized_TADB.ipynb (step ix): generate topologically associated domain (TAD) files and their boundaries.

  3. notebook_for_LD_block_reference_panel.ipynb (step x): produce LD blocks and the reference panel.

Input Files#

The download steps need only internet access; nothing local is required to begin. The later formatting and indexing steps consume the products of the preceding steps under the reference_data/ working directory.

File

Description

GRCh38_full_analysis_set_plus_decoy_hla.fa

Reference genome FASTA (downloaded)

Homo_sapiens.GRCh38.103.chr.gtf

Ensembl gene annotation GTF (downloaded)

ERCC92.fa / ERCC92.gtf

ERCC spike-in reference (downloaded)

00-All.vcf.gz

dbSNP variant reference (downloaded)

Steps#

1. Download Reference Data#

Download the human reference genome, the Ensembl gene annotation, the ERCC spike-in reference, and the dbSNP variant file into the reference_data directory.

Timing: ~varies by dataset on typical compute infrastructure.

Output Files#

File

Description

reference_data/*.reference.fasta

Formatted genome reference

reference_data/*.gtf (and gene feature files)

Formatted gene annotation

reference_data/STAR_Index/

STAR aligner index

reference_data/RSEM_Index/

RSEM quantification index

reference_data/*.refFlat

Picard-compatible refFlat annotation

reference_data/*_suppa_annotation

SUPPA annotation for psichomics

reference_data/TAD/*.bed

Topologically associated domain windows

reference_data/LD_blocks/

LD blocks and reference panel

sos run pipeline/reference_data_preparation.ipynb download_hg_reference --cwd output/reference_data
sos run pipeline/reference_data_preparation.ipynb download_gene_annotation --cwd output/reference_data
sos run pipeline/reference_data_preparation.ipynb download_ercc_reference --cwd output/reference_data
sos run pipeline/reference_data_preparation.ipynb download_dbsnp --cwd output/reference_data

2. Format Reference Data#

Merge the human genome with the ERCC spike-in sequences and standardise the FASTA so downstream tools share one consistent reference.

sos run pipeline/reference_data_preparation.ipynb hg_reference \
    --cwd reference_data \
    --ercc-reference reference_data/ERCC92.fa \
    --hg-reference reference_data/GRCh38_full_analysis_set_plus_decoy_hla.fa

3. Format Gene Feature Data#

Reformat the gene annotation (and append the ERCC features) to produce the collapsed and full GTFs used for expression quantification.

sos run pipeline/reference_data_preparation.ipynb gene_annotation \
    --cwd reference_data \
    --ercc-gtf reference_data/ERCC92.gtf \
    --hg-gtf reference_data/Homo_sapiens.GRCh38.103.chr.gtf \
    --hg-reference reference_data/GRCh38_full_analysis_set_plus_decoy_hla.noALT_noHLA_noDecoy.fasta \
    --stranded

4. Generate STAR Index#

Build the STAR genome index used by the RNA-seq alignment step. This is compute- and memory-intensive.

sos run pipeline/reference_data_preparation.ipynb STAR_index \
    --cwd reference_data \
    --hg-reference reference_data/GRCh38_full_analysis_set_plus_decoy_hla.noALT_noHLA_noDecoy_ERCC.fasta

5. Generate RSEM Index#

Build the RSEM index used for transcript-level expression quantification.

sos run pipeline/reference_data_preparation.ipynb RSEM_index \
    --cwd reference_data \
    --hg-reference reference_data/GRCh38_full_analysis_set_plus_decoy_hla.noALT_noHLA_noDecoy_ERCC.fasta \
    --hg-gtf reference_data/Homo_sapiens.GRCh38.103.chr.reformatted.ERCC.gtf

6. Generate RefFlat Annotation for Picard#

Generate the RefFlat-format annotation used by Picard for RNA-seq QC metrics.

sos run pipeline/reference_data_preparation.ipynb RefFlat_generation \
    --cwd reference_data \
    --hg-gtf reference_data/Homo_sapiens.GRCh38.103.chr.reformatted.ERCC.gtf

7. Generate SUPPA Annotation for Psichomics#

Generate the SUPPA splicing-event annotation consumed by the psichomics splicing workflow.

sos run pipeline/reference_data_preparation.ipynb SUPPA_annotation \
    --cwd reference_data \
    --hg-gtf reference_data/Homo_sapiens.GRCh38.103.chr.reformatted.ERCC.gtf

8. Extract rsIDs for Known Variants#

Annotate the genotype VCF with dbSNP rsIDs. This optional step is implemented in the VCF_QC.ipynb module.

sos run pipeline/VCF_QC.ipynb dbsnp_annotate \
    --genoFile reference_data/00-All.vcf.gz

9. Generate Topologically Associated Domains#

TAD files and their boundaries are produced with the interactive generalized_TAD.ipynb notebook.

# interactive notebook
generalized_TAD.ipynb

10. Produce LD Blocks and Reference Panel#

LD blocks and the reference panel are produced with the interactive notebook_for_LD_block_reference_panel.ipynb notebook. See the linked documentation for the interactive walk-through.

Anticipated Results#

The pipeline uses the following reference data for RNA-seq expression quantification:

  1. GRCh38_full_analysis_set_plus_decoy_hla.noALT_noHLA_noDecoy_ERCC.{dict,fasta,fasta.fai}

  2. Homo_sapiens.GRCh38.103.chr.reformatted.collapse_only.gene.ERCC.gtf for the stranded protocol, and Homo_sapiens.GRCh38.103.chr.reformatted.gene.ERCC.gtf for the unstranded protocol.

  3. Everything under the STAR_Index folder.

  4. Everything under the RSEM_Index folder.

  5. Optionally, for quality control, gtf_ref.flat.

For alternative splicing: Homo_sapiens.GRCh38.103.chr.reformatted.ERCC.SUPPA_annotation.rds (psichomics).

For topologically associated domains and boundary files: generalized_TAD.tsv, generalized_TADB.tsv, TADB_enhanced_cis.bed, and extended_TADB.bed.

Command interface#

List the workflows and parameters available in the reference data preparation module.

sos run pipeline/reference_data_preparation.ipynb -h