Reference Data

Reference Data#

Miniprotocol Timing#

Timing ~4 hours

Overview#

This miniprotocol shows the use of various modules to download, index and preprocess reference data for use throughout the pipeline. The modules are as follows:

reference_data_preparation.ipynb (steps i-viii): Download and format reference files
generalized_TADB.ipynb (step ix): generate topologically associated domain files and their boundaries
notebook_for_LD_block_reference_panel.ipynb (step x): production of LD blocks and reference panel

Steps#

i. Download Reference Data #

sos run pipeline/reference_data_preparation.ipynb download_hg_reference --cwd reference_data
sos run pipeline/reference_data_preparation.ipynb download_gene_annotation --cwd reference_data
sos run pipeline/reference_data_preparation.ipynb download_ercc_reference --cwd reference_data
sos run pipeline/reference_data_preparation.ipynb download_dbsnp --cwd reference_data

ii. Format Reference Data #

sos run pipeline/reference_data_preparation.ipynb hg_reference \
    --cwd reference_data \
    --ercc-reference reference_data/ERCC92.fa \
    --hg-reference reference_data/GRCh38_full_analysis_set_plus_decoy_hla.fa 

iii. Format Gene Feature Data #

sos run pipeline/reference_data_preparation.ipynb gene_annotation \
    --cwd reference_data \
    --ercc-gtf reference_data/ERCC92.gtf \
    --hg-gtf reference_data/Homo_sapiens.GRCh38.103.chr.gtf \
    --hg-reference reference_data/GRCh38_full_analysis_set_plus_decoy_hla.noALT_noHLA_noDecoy.fasta \
    --stranded 

iv. Generate STAR Index #

sos run pipeline/reference_data_preparation.ipynb STAR_index \
    --cwd reference_data \
    --hg-reference reference_data/GRCh38_full_analysis_set_plus_decoy_hla.noALT_noHLA_noDecoy_ERCC.fasta 

v. Generate RSEM Index #

sos run pipeline/reference_data_preparation.ipynb RSEM_index \
    --cwd reference_data \
    --hg-reference reference_data/GRCh38_full_analysis_set_plus_decoy_hla.noALT_noHLA_noDecoy_ERCC.fasta \
    --hg-gtf reference_data/Homo_sapiens.GRCh38.103.chr.reformatted.ERCC.gtf 

vi. Generate RefFlat Annotation for Picard #

sos run pipeline/reference_data_preparation.ipynb RefFlat_generation \
    --cwd reference_data \
    --hg-gtf reference_data/Homo_sapiens.GRCh38.103.chr.reformatted.ERCC.gtf 

vii. Generate SUPPA Annotation for Psichomics #

sos run pipeline/reference_data_preparation.ipynb SUPPA_annotation \
    --cwd reference_data \
    --hg_gtf reference_data/Homo_sapiens.GRCh38.103.chr.reformatted.ERCC.gtf 

viii. Extract rsIDs for known variants #

sos run pipeline/VCF_QC.ipynb dbsnp_annotate \
    --genoFile reference_data/00-All.vcf.gz 

ix. Generation of topologically associated domains and their boundaries #

# interactive notebook
generalized_TAD.ipynb

x. production of LD blocks and reference panel #

FIXME

Anticipated Results#

Our pipeline uses the following reference data for RNA-seq expression quantification:

GRCh38_full_analysis_set_plus_decoy_hla.noALT_noHLA_noDecoy_ERCC.{dict,fasta,fasta.fai}
Homo_sapiens.GRCh38.103.chr.reformatted.collapse_only.gene.ERCC.gtf for stranded protocol, and Homo_sapiens.GRCh38.103.chr.reformatted.gene.ERCC.gtf for unstranded protocol.
Everything under STAR_Index folder
Everything under RSEM_Index folder
Optionally, for quality control, gtf_ref.flat

The following reference files are used for methylation:

To be added by Alexandre

The following reference files are used for alternative splicing:

Homo_sapiens.GRCh38.103.chr.reformatted.ERCC.SUPPA_annotation.rds for psichomics.

The following reference files are used for topologically associated domain and boundary files:

generalized_TAD.tsv
generalized_TADB.tsv
TADB_enhanced_cis.bed
extended_TADB.bed

Reference Data

Contents

Reference Data#

Miniprotocol Timing#

Overview#

Steps#

i. Download Reference Data#

ii. Format Reference Data#

iii. Format Gene Feature Data#

iv. Generate STAR Index#

v. Generate RSEM Index#

vi. Generate RefFlat Annotation for Picard#

vii. Generate SUPPA Annotation for Psichomics#

viii. Extract rsIDs for known variants#

ix. Generation of topologically associated domains and their boundaries#

x. production of LD blocks and reference panel#