Reference Data#
Miniprotocol Timing#
Timing ~4 hours
Overview#
This miniprotocol shows the use of various modules to download, index and preprocess reference data for use throughout the pipeline. The modules are as follows:
reference_data_preparation.ipynb
(steps i-viii): Download and format reference filesgeneralized_TADB.ipynb
(step ix): generate topologically associated domain files and their boundariesnotebook_for_LD_block_reference_panel.ipynb
(step x): production of LD blocks and reference panel
Steps#
i. Download Reference Data#
sos run pipeline/reference_data_preparation.ipynb download_hg_reference --cwd reference_data
sos run pipeline/reference_data_preparation.ipynb download_gene_annotation --cwd reference_data
sos run pipeline/reference_data_preparation.ipynb download_ercc_reference --cwd reference_data
sos run pipeline/reference_data_preparation.ipynb download_dbsnp --cwd reference_data
ii. Format Reference Data#
sos run pipeline/reference_data_preparation.ipynb hg_reference \
--cwd reference_data \
--ercc-reference reference_data/ERCC92.fa \
--hg-reference reference_data/GRCh38_full_analysis_set_plus_decoy_hla.fa
iii. Format Gene Feature Data#
sos run pipeline/reference_data_preparation.ipynb gene_annotation \
--cwd reference_data \
--ercc-gtf reference_data/ERCC92.gtf \
--hg-gtf reference_data/Homo_sapiens.GRCh38.103.chr.gtf \
--hg-reference reference_data/GRCh38_full_analysis_set_plus_decoy_hla.noALT_noHLA_noDecoy.fasta \
--stranded
iv. Generate STAR Index#
sos run pipeline/reference_data_preparation.ipynb STAR_index \
--cwd reference_data \
--hg-reference reference_data/GRCh38_full_analysis_set_plus_decoy_hla.noALT_noHLA_noDecoy_ERCC.fasta
v. Generate RSEM Index#
sos run pipeline/reference_data_preparation.ipynb RSEM_index \
--cwd reference_data \
--hg-reference reference_data/GRCh38_full_analysis_set_plus_decoy_hla.noALT_noHLA_noDecoy_ERCC.fasta \
--hg-gtf reference_data/Homo_sapiens.GRCh38.103.chr.reformatted.ERCC.gtf
vi. Generate RefFlat Annotation for Picard#
sos run pipeline/reference_data_preparation.ipynb RefFlat_generation \
--cwd reference_data \
--hg-gtf reference_data/Homo_sapiens.GRCh38.103.chr.reformatted.ERCC.gtf
vii. Generate SUPPA Annotation for Psichomics#
sos run pipeline/reference_data_preparation.ipynb SUPPA_annotation \
--cwd reference_data \
--hg_gtf reference_data/Homo_sapiens.GRCh38.103.chr.reformatted.ERCC.gtf
viii. Extract rsIDs for known variants#
sos run pipeline/VCF_QC.ipynb dbsnp_annotate \
--genoFile reference_data/00-All.vcf.gz
ix. Generation of topologically associated domains and their boundaries#
# interactive notebook
generalized_TAD.ipynb
x. production of LD blocks and reference panel#
FIXME
Anticipated Results#
Our pipeline uses the following reference data for RNA-seq expression quantification:
GRCh38_full_analysis_set_plus_decoy_hla.noALT_noHLA_noDecoy_ERCC.{dict,fasta,fasta.fai}
Homo_sapiens.GRCh38.103.chr.reformatted.collapse_only.gene.ERCC.gtf
for stranded protocol, andHomo_sapiens.GRCh38.103.chr.reformatted.gene.ERCC.gtf
for unstranded protocol.Everything under
STAR_Index
folderEverything under
RSEM_Index
folderOptionally, for quality control,
gtf_ref.flat
The following reference files are used for methylation:
To be added by Alexandre
The following reference files are used for alternative splicing:
Homo_sapiens.GRCh38.103.chr.reformatted.ERCC.SUPPA_annotation.rds
for psichomics.
The following reference files are used for topologically associated domain and boundary files:
generalized_TAD.tsv
generalized_TADB.tsv
TADB_enhanced_cis.bed
extended_TADB.bed