Reference Data Standardization

Reference Data Standardization#

Description#

We follow the TOPMed workflow from Broad for the aquisition and preparation of reference files for use throughout our pipeline. Our processed reference data may be found in this folder on Synapse. We have included the PDF document compiled by Data Standardization Working Group in the on Synapse as well as on ADSP Dashboard. It contains the reference data to use for the project.

Command interface#

The pipeline below is organized as a sequence of standalone steps, each with its own narrative and sos run command. All steps write into a shared output/reference_data working directory so that later steps can consume the artefacts produced by earlier ones.

1. Download reference data#

This step downloads the unmodified source files that every downstream step builds upon: the human reference genome, the Ensembl gene annotation, the ERCC spike-in sequences, and the dbSNP variant call set.

Input: None.

Output:

00-All.vcf.gz
00-All.vcf.gz.tbi
ERCC92.fa
ERCC92.gtf
ERCC92.zip
GRCh38_full_analysis_set_plus_decoy_hla.fa
Homo_sapiens.GRCh38.103.chr.gtf
Homo_sapiens.GRCh38.103.chr.gtf.gz

sos run pipeline/reference_data_preparation.ipynb download_hg_reference --cwd output/reference_data
sos run pipeline/reference_data_preparation.ipynb download_gene_annotation --cwd output/reference_data
sos run pipeline/reference_data_preparation.ipynb download_ercc_reference --cwd output/reference_data
sos run pipeline/reference_data_preparation.ipynb download_dbsnp --cwd output/reference_data

2. Format reference genome#

The downloaded genome is reformatted into the version used by the rest of the protocol: the HLA, ALT and Decoy records are removed because none of the downstream RNA-seq calling components handle them properly, the ERCC spike-in sequences are appended (these do no harm even when ERCC is not part of the RNA-seq library), and the resulting FASTA is indexed.

Input: a reference FASTA file.

Output: a reference FASTA file without HLA, ALT and Decoy but with ERCC:

ERCC92.patched.fa
GRCh38_full_analysis_set_plus_decoy_hla.noALT_noHLA_noDecoy.fasta
GRCh38_full_analysis_set_plus_decoy_hla.noALT_noHLA_noDecoy_ERCC.dict
GRCh38_full_analysis_set_plus_decoy_hla.noALT_noHLA_noDecoy_ERCC.fasta
GRCh38_full_analysis_set_plus_decoy_hla.noALT_noHLA_noDecoy_ERCC.fasta.fai
Homo_sapiens.GRCh38.103.chr.reformatted.gtf

sos run pipeline/reference_data_preparation.ipynb hg_reference \
    --cwd output/reference_data \
    --ercc-reference output/reference_data/ERCC92.fa \
    --hg-reference output/reference_data/GRCh38_full_analysis_set_plus_decoy_hla.fa 

3. Transcript and gene model reference processing#

This step modifies the GTF file for two reasons. First, RSEM requires the GTF to use the same chromosome naming convention (with a chr prefix) as the FASTA file; although this can also be addressed with the --sjdbGTFchrPrefix "chr" option for STAR, the chr prefix is added to the GTF for use with RSEM. Second, the gene-model collapsing script collapse_annotation.py from GTEx expects the GTF to carry a transcript_type attribute rather than transcript_biotype; this is renamed here while building the Docker image. ERCC information is also added to the GTF reference as a further customization.

Collapsed gene model. Gene-level expression and eQTLs in the GTEx project are computed on a collapsed gene model (all isoforms of a gene combined into a single transcript), following these rules: transcripts annotated as retained_intron or read_through are excluded, and transcripts overlapping annotated read-through transcripts may be blacklisted (blacklists for GENCODE v19, v24 and v25 are provided in the GTEx repository; no transcripts were blacklisted for v26); the union of all exon intervals of each gene is taken; and overlapping intervals between genes are excluded. Excluding overlapping regions (step 3) primarily removes genes annotated on both strands that cannot be unambiguously quantified from unstranded RNA-seq; for stranded protocols this step can be skipped with the --collapse_only flag. Further documentation is available on the GTEx Portal.

Input: an uncompressed GFF3 file (i.e. one that can be read with cat) and a GTF file.

Output: a GTF file with a collapsed gene model.

sos run pipeline/reference_data_preparation.ipynb hg_gtf \
    --cwd output/reference_data \
    --hg-gtf output/reference_data/Homo_sapiens.GRCh38.103.chr.gtf \
    --hg-reference output/reference_data/GRCh38_full_analysis_set_plus_decoy_hla.noALT_noHLA_noDecoy.fasta \
    --stranded 

4. Gene annotation#

This step assembles the gene annotation used downstream by combining the ERCC GTF, the reformatted Ensembl GTF, and the formatted reference genome into a single coherent annotation set.

Input: a GTF file and an accompanying FASTA file.

Output: the formatted gene annotation reference.

sos run pipeline/reference_data_preparation.ipynb gene_annotation \
    --cwd output/reference_data \
    --ercc-gtf output/reference_data/ERCC92.gtf \
    --hg-gtf output/reference_data/Homo_sapiens.GRCh38.103.chr.gtf \
    --hg-reference output/reference_data/GRCh38_full_analysis_set_plus_decoy_hla.noALT_noHLA_noDecoy.fasta \
    --stranded 

5. Generate STAR index#

This step generates the index file used for STAR alignment. The index needs to be generated only once and can then be re-used. At least 40GB of memory is required.

The gtf and fasta parameters give the paths to the reference sequence and both must be unzipped; the gtf should be the one prior to collapsing by gene. The sjdbOverhang parameter specifies the length of genomic sequence around each annotated junction used when constructing the splice-junction database; ideally it should equal the read length minus one, where read length is the length of the reads. A value of 100 is used here as recommended by the TOPMed pipeline (see additional discussion).

Input: a GTF file and an accompanying FASTA file.

Output: an index folder stored in {cwd}/STAR_Index, used by STAR.

sos run pipeline/reference_data_preparation.ipynb STAR_index \
    --cwd output/reference_data \
    --hg-reference output/reference_data/GRCh38_full_analysis_set_plus_decoy_hla.noALT_noHLA_noDecoy_ERCC.fasta \
    --numThreads 10 \
    --mem 40G

6. Generate RSEM index#

This step generates the indexing file used for RSEM. The index needs to be generated only once.

The gtf and fasta parameters give the paths to the reference sequence; the gtf should be the one prior to collapsing by gene. The sjdbOverhang parameter specifies the length of genomic sequence around each annotated junction used when constructing the splice-junction database; ideally it should equal the read length minus one.

Input: a GTF file and an accompanying FASTA file.

Output: an index folder stored in RSEM_Index, used by RSEM.

sos run pipeline/reference_data_preparation.ipynb RSEM_index \
    --cwd output/reference_data \
    --hg-reference output/reference_data/GRCh38_full_analysis_set_plus_decoy_hla.noALT_noHLA_noDecoy_ERCC.fasta \
    --hg-gtf output/reference_data/Homo_sapiens.GRCh38.103.chr.reformatted.ERCC.gtf

7. Generate RefFlat annotation#

This file is required by the Picard CollectRnaSeqMetrics module, which produces metrics describing the distribution of bases within the transcripts. It calculates the total numbers and fractions of nucleotides within specific genomic regions, including untranslated regions (UTRs), introns, intergenic sequences (between discrete genes), and peptide-coding sequences (exons). The tool also determines the number of bases that pass quality filters specific to Illumina data (PF_BASES).

Input: a GTF file.

Output: a ref.flat file.

sos run pipeline/reference_data_preparation.ipynb RefFlat_generation \
    --cwd output/reference_data \
    --hg-gtf output/reference_data/Homo_sapiens.GRCh38.103.chr.reformatted.ERCC.gtf

8. Generate SUPPA annotation#

The custom alternative-splicing annotation is generated for psichomics, based on this tutorial; the procedure for generating the local alternative-splicing SUPPA output is documented on the SUPPA GitHub page. Because the original annotation provided by the psichomics package uses gene symbols, it is modified here to use Ensembl IDs. The modified annotation is then used for detecting RNA alternative splicing with psichomics.

Input: a GTF file.

Output: a .SUPPA_annotation.rds file.

sos run pipeline/reference_data_preparation.ipynb SUPPA_annotation \
    --cwd output/reference_data \
    --hg_gtf output/reference_data/Homo_sapiens.GRCh38.103.chr.reformatted.ERCC.gtf 

9. Extract rsIDs and TAD annotation#

The dbSNP call set is processed to extract rsIDs for known variants, producing a lookup used to annotate variants throughout the protocol. The TAD-associated index is built from the topologically associating domains defined in Wang et al. to delimit the genotype cis region used for fine-mapping across phenotypes; the TAD dataframe for each cell type and tissue is downloaded from the hg38 TAD resource. The TAD file is generated in three steps: the union of TADs across tissues, extension with TAD boundaries, and extension with cis windows.

Input: a .vcf.gz file (for rsIDs) and a file containing a list of TADs and a region list (for TAD annotation).

Output: a .variants.gz file (rsIDs) and a .annotation file (TAD).

sos run pipeline/VCF_QC.ipynb dbsnp_annotate \
    --genoFile output/reference_data/00-All.vcf.gz \
    --cwd output/reference_data 

Output#

The steps above write a standardized reference bundle that the rest of the protocol consumes:

File	Produced by
Cleaned reference genome (FASTA) and matched gene model (GTF)	Steps 1–2 (download & format reference data)
Per-feature gene-coordinate table	Step 3 (format gene feature data)
STAR genome index	Step 4 (generate STAR index)
RSEM reference	Step 5 (generate RSEM index)
RefFlat annotation (for Picard RNA-seq metrics)	Step 6
SUPPA event annotation (for psichomics splicing)	Step 7
rsID lookup for known variants	Step 8 (extract rsIDs)

Anticipated Results#

Running all eight steps produces a complete, standardized reference-data bundle for the rest of the protocol: a cleaned reference genome and matched gene-model annotation (GTF), per-feature gene-coordinate tables, a STAR genome index and an RSEM reference for RNA-seq alignment/quantification, a RefFlat annotation for Picard RNA-seq metrics, a SUPPA event annotation for psichomics-based splicing analysis, and an rsID lookup for known variants. These files are written once and then reused as fixed inputs by the molecular-phenotype, association, and downstream modules, so that every analysis in the protocol is run against the same curated reference.

Workflow implementation#

The SoS step definitions below implement the commands described above.

[global]
# The output directory for generated files.
parameter: cwd = path("output")
# For cluster jobs, number commands to run per job
parameter: job_size = 1
# Wall clock time expected
parameter: walltime = "5h"
# Memory expected
parameter: mem = "16G"
# Number of threads
parameter: numThreads = 8
# Software container option
parameter: container = ""
cwd = path(f'{cwd:a}')
from sos.utils import expand_size

[download_hg_reference]
output: f"{cwd:a}/GRCh38_full_analysis_set_plus_decoy_hla.fa"
download: dest_dir = cwd
    ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/technical/reference/GRCh38_reference_genome/GRCh38_full_analysis_set_plus_decoy_hla.fa

[download_gene_annotation]
output: f"{cwd:a}/Homo_sapiens.GRCh38.103.chr.gtf"
download: dest_dir = cwd, decompress=True
    http://ftp.ensembl.org/pub/release-103/gtf/homo_sapiens/Homo_sapiens.GRCh38.103.chr.gtf.gz

[download_ercc_reference]
output: f"{cwd:a}/ERCC92.gtf", f"{cwd:a}/ERCC92.fa"
download: dest_dir = cwd, decompress=True
    https://assets.thermofisher.com/TFS-Assets/LSG/manuals/ERCC92.zip

[download_dbsnp]
output: f"{cwd:a}/00-All.vcf.gz", f"{cwd:a}/00-All.vcf.gz.tbi"
download: dest_dir = cwd
    ftp://ftp.ncbi.nlm.nih.gov/snp/organisms/human_9606/VCF/00-All.vcf.gz
    ftp://ftp.ncbi.nlm.nih.gov/snp/organisms/human_9606/VCF/00-All.vcf.gz.tbi

[gff3_to_gtf]
parameter: gff3_file = path
input: gff3_file
output: f'{cwd}/{_input:bn}.gtf'
task: trunk_workers = 1, trunk_size = job_size, walltime = walltime,  mem = mem, tags = f'{step_name}_{_output:bn}'
bash: container=container, expand= "${ }", stderr = f'{_output:n}.stderr', stdout = f'{_output:n}.stdout'
        gffread ${_input} -T -o ${_output}

[hg_reference_1 (HLA ALT Decoy removal)]
# Path to HG reference file
parameter: hg_reference = path
input: hg_reference
output:  f'{cwd}/{_input:bn}.noALT_noHLA_noDecoy.fasta'
task: trunk_workers = 1, trunk_size = job_size, walltime = walltime,  mem = mem, tags = f'{step_name}_{_output:bn}'
python: expand = "${ }", stderr = f'{_output}.stderr', stdout = f'{_output}.stdout', container = container
    with open('${_input}', 'r') as fasta:
        contigs = fasta.read()
        contigs = contigs.split('>')
        contig_ids = [i.split(' ', 1)[0] for i in contigs]

        # exclude ALT, HLA and decoy contigs
        filtered_fasta = '>'.join([c for i,c in zip(contig_ids, contigs)
        if not (i[-4:]=='_alt' or i[:3]=='HLA' or i[-6:]=='_decoy')])
    
    with open('${_output}', 'w') as fasta:
        fasta.write(filtered_fasta)

[hg_reference_2 (merge with ERCC reference)]
parameter: ercc_reference = path
output: f'{cwd}/{_input:bn}_ERCC.fasta'
task: trunk_workers = 1, trunk_size = job_size, walltime = walltime,  mem = mem, tags = f'{step_name}_{_output:bn}'
bash: expand = "${ }", stderr = f'{_output[0]}.stderr', stdout = f'{_output}.stdout', container = container
    sed 's/ERCC-/ERCC_/g' ${ercc_reference} > ${ercc_reference:n}.patched.fa
    cat ${_input} ${ercc_reference:n}.patched.fa > ${_output}

[hg_reference_3 (index the fasta file)]
output: f'{cwd}/{_input:bn}.dict'
task: trunk_workers = 1, trunk_size = job_size, walltime = walltime,  mem = mem, tags = f'{step_name}_{_output:bn}'
bash: expand = "${ }", stderr = f'{_output}.stderr', stdout = f'{_output}.stdout', container = container
    samtools faidx ${_input}
    picard CreateSequenceDictionary \
        R=${_input} \
        O=${_output}

[hg_gtf_1 (add chr prefix to gtf file)]
parameter: hg_reference = path
parameter: hg_gtf = path
input: hg_reference, hg_gtf
output: f'{cwd}/{_input[1]:bn}.reformatted.gtf'
task: trunk_workers = 1, trunk_size = job_size, walltime = walltime,  mem = mem, tags = f'{step_name}_{_output:bn}'
R: expand = "${ }", stderr = f'{_output}.stderr', stdout = f'{_output}.stdout', container = container
    library("readr")
    library("stringr")
    library("dplyr")
    options(scipen = 999)
    con <- file("${_input[0]}","r")
    fasta <- readLines(con,n=1)
    close(con)
    gtf = read_delim("${_input[1]}", "\t",  col_names  = F, comment = "#", col_types="ccccccccc")
    if(!str_detect(fasta,">chr")) {
        gtf_mod = gtf%>%mutate(X1 = str_remove_all(X1,"chr"))
    } else if (!any(str_detect(gtf$X1[1],"chr"))) {
        gtf_mod = gtf%>%mutate(X1 = paste0("chr",X1))
    } else (gtf_mod = gtf)
    if(any(str_detect(gtf_mod$X9, "transcript_biotype"))) {
      gtf_mod = gtf_mod%>%mutate(X9 = str_replace_all(X9,"transcript_biotype","transcript_type"))
    }
    gtf_mod%>%write.table("${_output}",sep = "\t",quote = FALSE,col.names = F,row.names = F)

[hg_gtf_2 (collapsed gene model)]
parameter: stranded = bool
output: f'{_input:n}{".collapse_only" if stranded else ""}.gene.gtf'
task: trunk_workers = 1, trunk_size = job_size, walltime = walltime, mem = mem, tags = f'{step_name}_{_output:bn}'
bash: expand = "${ }", stderr = f'{_output}.stderr', stdout = f'{_output}.stdout', container = container
    collapse_annotation.py ${"--collapse_only" if stranded else ""} ${_input} ${_output}

[ercc_gtf (Preprocess ERCC gtf file)]
parameter: ercc_gtf = path
input: ercc_gtf
output: f'{cwd}/{_input:bn}.genes.patched.gtf'
task: trunk_workers = 1, trunk_size = job_size, walltime = walltime,  mem = mem, tags = f'{step_name}_{_output:bn}'
python: expand = "${ }", stderr = f'{_output}.stderr', stdout = f'{_output}.stdout', container = container
    with open('${_input}') as exon_gtf, open('${_output}', 'w') as gene_gtf:
        for line in exon_gtf:
            f = line.strip().split('\t')
            f[0] = f[0].replace('-','_')  # required for RNA-SeQC/GATK (no '-' in contig name)
        
            attr = f[8]
            if attr[-1]==';':
                attr = attr[:-1]
            attr = dict([i.split(' ') for i in attr.replace('"','').split('; ')])
            # add gene_name, gene_type
            attr['gene_name'] = attr['gene_id']
            attr['gene_type'] = 'ercc_control'
            attr['gene_status'] = 'KNOWN'
            attr['level'] = 2
            for k in ['id', 'type', 'name', 'status']:
                attr['transcript_'+k] = attr['gene_'+k]
        
            attr_str = []
            for k in ['gene_id', 'transcript_id', 'gene_type', 'gene_status', 'gene_name',
                'transcript_type', 'transcript_status', 'transcript_name']:
                attr_str.append('{0:s} "{1:s}";'.format(k, attr[k]))
            attr_str.append('{0:s} {1:d};'.format('level', attr['level']))
            f[8] = ' '.join(attr_str)
        
            # write gene, transcript, exon
            gene_gtf.write('\t'.join(f[:2]+['gene']+f[3:])+'\n')
            gene_gtf.write('\t'.join(f[:2]+['transcript']+f[3:])+'\n')
            f[8] = ' '.join(attr_str[:2])
            gene_gtf.write('\t'.join(f[:2]+['exon']+f[3:])+'\n')

[gene_annotation]
input: output_from("hg_gtf_1"), output_from("hg_gtf_2"), output_from("ercc_gtf")
output: f'{cwd}/{_input[0]:bn}.ERCC.gtf', f'{cwd}/{_input[1]:bn}.ERCC.gtf'
task: trunk_workers = 1, trunk_size = job_size, walltime = walltime,  mem = mem, tags = f'{step_name}_{_output:bn}'
bash: expand = "${ }", stderr = f'{_output[0]}.stderr', stdout = f'{_output[0]}.stdout', container = container
    cat ${_input[0]} ${_input[2]} > ${_output[0]}
    cat ${_input[1]} ${_input[2]} > ${_output[1]}

[STAR_index]
parameter: hg_reference = path
# Specifies the length of the genomic sequence around the annotated junction to be used in constructing the splice junctions database. Ideally, this length should be equal to the ReadLength-1, where ReadLength is the length of the reads.
# Default choice follows from TOPMed pipeline recommendation.
if expand_size(mem) < expand_size('40G'):
    print("Insufficent memory for STAR, changing to 40G")
    star_mem = '40G'
else:
    star_mem = mem
input: hg_reference
output: f"{cwd}/STAR_Index/chrName.txt", 
        f"{cwd}/STAR_Index/SAindex", f"{cwd}/STAR_Index/SA", f"{cwd}/STAR_Index/genomeParameters.txt", 
        f"{cwd}/STAR_Index/chrStart.txt",
        f"{cwd}/STAR_Index/chrLength.txt", 
        f"{cwd}/STAR_Index/Genome", f"{cwd}/STAR_Index/chrNameLength.txt"
task: trunk_workers = 1, trunk_size = job_size, walltime = walltime, mem = mem, tags = f'{step_name}_{_output[0]:bd}', cores = numThreads
bash: container=container, expand= "${ }", stderr = f'{_output[1]:n}.stderr', stdout = f'{_output[1]:n}.stdout'
    STAR --runMode genomeGenerate \
         --genomeDir ${_output[0]:d} \
         --genomeFastaFiles ${_input[0]} \
         --runThreadN ${numThreads}

[RSEM_index]
parameter: hg_gtf = path
parameter: hg_reference = path
input: hg_reference, hg_gtf
output: f"{cwd}/RSEM_Index/rsem_reference.n2g.idx.fa", f"{cwd}/RSEM_Index/rsem_reference.grp", 
        f"{cwd}/RSEM_Index/rsem_reference.idx.fa", f"{cwd}/RSEM_Index/rsem_reference.ti", 
        f"{cwd}/RSEM_Index/rsem_reference.chrlist", f"{cwd}/RSEM_Index/rsem_reference.seq", 
        f"{cwd}/RSEM_Index/rsem_reference.transcripts.fa"
task: trunk_workers = 1, trunk_size = job_size, walltime = walltime, mem = mem, tags = f'{step_name}_{_output[0]:bd}'
bash: container=container, expand= "${ }", stderr = f'{_output[1]:n}.stderr', stdout = f'{_output[1]:n}.stdout'
    rsem-prepare-reference \
            ${_input[0]} \
            ${_output[1]:n} \
            --gtf ${_input[1]} \
            --num-threads ${numThreads}

[RefFlat_generation]
parameter: hg_gtf = path
input: hg_gtf
output: f'{_input:n}.ref.flat'
task: trunk_workers = 1, trunk_size = job_size, walltime = walltime, mem = mem, tags = f'{step_name}_{_output[0]:bd}'
bash: container=container, expand= "${ }", stderr = f'{_output}.stderr', stdout = f'{_output}.stdout'
    gtfToGenePred ${_input}  ${_output}.tmp -genePredExt -geneNameAsName2
    awk -F'\t' -v OFS="\t" '{$1=$12 OFS $1;}7' ${_output}.tmp | cut -f 1-11 > ${_output}
    rm ${_output}.tmp

[SUPPA_annotation_1]
parameter: hg_gtf = path
input: hg_gtf
output: f'{cwd}/hg38.{_input:bn}_SE_strict.ioe' # The stderr file must not shared the same start with the output file
task: trunk_workers = 1, trunk_size = job_size, walltime = walltime, mem = mem, tags = f'{step_name}_{_output:bn}'
bash: container=container, expand= "${ }", stderr = f'{cwd}/{_input:bn}.stderr', stdout = f'{cwd}/{_input:bn}.stdout'
    suppa.py generateEvents -i ${_input} -o ${cwd}/hg38.${_input:bn} -f ioe -e SE SS MX RI FL

[SUPPA_annotation_2]
parameter: hg_gtf = path
output: f'{cwd}/{hg_gtf:bn}.SUPPA_annotation.rds'
task: trunk_workers = 1, trunk_size = job_size, walltime = walltime, mem = mem, tags = f'{step_name}_{_output:bn}'
R: container=container, expand= "${ }", stderr = f'{_output}.stderr', stdout = f'{_output}.stdout'
    library("psichomics")
    suppa <- parseSuppaAnnotation("${_input:d}", genome="hg38") 
    annot <- prepareAnnotationFromEvents(suppa)
    saveRDS(annot, file=${_output:r})

[psi_hg38_annotation]
parameter: hg_gtf = path
# FIXME: Please document what this file is and where do we get it @xuanhe.
parameter: hgrc_db = path
input: hg_gtf, hgrc_db
output: f'{cwd}/psichomics_hg38_annotation.rds'
task: trunk_workers = 1, trunk_size = job_size, walltime = walltime, mem = mem, tags = f'{step_name}_{_output:bn}'
R: container=container, expand= "${ }", stderr = f'{_output}.stderr', stdout = f'{_output}.stdout'
    library("psichomics")
    library("purrr")
    library("dplyr")
    library("tidyr")
    library("data.table")
  
    # load psicomics default annotation, option of hg38 from listSplicingAnnotations()
    annotation <- loadAnnotation("AH63657")

    
    # reduce the demension of annotation file
    annotation <- 
      map(annotation, ~.x%>%
                       tidyr::unnest(cols = `Gene`))
  
    # Create empty colomns for each event for easier mapping
    annotation[["Tandem UTR"]][["SUPPA.Event.ID"]] <- NA
    annotation[["Tandem UTR"]][["VAST-TOOLS.Event.ID"]] <- NA
    annotation[["Alternative first exon"]][["VAST-TOOLS.Event.ID"]] <- NA
    annotation[["Alternative last exon"]][["VAST-TOOLS.Event.ID"]] <- NA
    annotation[["Mutually exclusive exon"]][["VAST-TOOLS.Event.ID"]] <- NA
  
    # extract Ensembl ID substring from original SUPPA.ID and VASTTOOL.ID
    annotation <- 
      map(annotation, ~.x%>%
                       mutate(ENSG.SUPPA = substr(`SUPPA.Event.ID`, 1, 15))%>%
                       mutate(ENSG.VAST = substr(`VAST-TOOLS.Event.ID`, 1, 15)))
  
    # Load gtf file
    gtf_sample <- read.table('${_input[0]}',header = FALSE, sep = '\t')
  
    # from the gtf file, seperate gene names and corresponding Ensembl ID
    gtf_sample <- separate(gtf_sample, V9, sep = ";",into = c("gene_id", "transcript_id", "exon_number", "gene_name"))
    gtf_sample <- separate(gtf_sample, gene_id, sep = " ",into = c("gene_id", "gene_id_val"))
    gtf_sample <- separate(gtf_sample, gene_name, sep = "e ",into = c("gene_name", "gene_name_val"))
  
    gtf_name_id_match <- gtf_sample[,c("gene_id_val","gene_name_val")]
    gtf_name_id_match <- gtf_name_id_match[!duplicated(gtf_name_id_match), ]
  
    # For any matched approved id in the psi hg38 annotation and gtf file, record the corresponding Ensembl ID
    annotation <-
      map(annotation, ~.x%>%
                       mutate(`ENSG.GTF` = gtf_name_id_match$gene_id_val[match(`Gene`, gtf_name_id_match$gene_name_val)]))
  
    # load hgnc database
    hgnc_db <- fread('${_input[1]}', fill = TRUE, header = TRUE, sep = '\t', quote="")
  
    # Combine the `Ensembl.ID.supplied.by.Ensembl.` and `Ensembl.gene.ID` column, if there are any conflict use the former
    # For conflict ones (15 total) both the former and latter records are poiting to the same gene name in Ensembl website so the order should not matter
    hgnc_db <- hgnc_db %>%
    mutate(ENSG.ID = ifelse(`Ensembl ID(supplied by Ensembl)` == "", `Ensembl gene ID`, `Ensembl ID(supplied by Ensembl)`))
  
    # Create a one to one reference list for approved names, previous names and aliases
    # There is no duplicate symbol and Ensembl id info for approved symbol so no need for chromosome verification
    hgnc_name_id_match <- hgnc_db[,c("Approved symbol","ENSG.ID")]
    hgnc_name_prev_check <- hgnc_db[,c("Previous symbols","Chromosome","ENSG.ID")]
    hgnc_name_alias_check <- hgnc_db[,c("Alias symbols","Chromosome","ENSG.ID")]
  
    # Remove NAs
    hgnc_name_prev_check <- hgnc_name_prev_check[hgnc_name_prev_check$ENSG.ID != "",]
    hgnc_name_alias_check <- hgnc_name_alias_check[hgnc_name_alias_check$ENSG.ID != "",]

    hgnc_name_prev_check <- hgnc_name_prev_check[hgnc_name_prev_check$"Previous symbols" != "",] 
    hgnc_name_alias_check <- hgnc_name_alias_check[hgnc_name_alias_check$"Alias symbols" != "",]

    # Seperate symbol column values from list of sybols to individual rows with one each
    hgnc_name_prev_check <- separate_rows(hgnc_name_prev_check, "Previous symbols", convert = FALSE)
    hgnc_name_alias_check <- separate_rows(hgnc_name_alias_check, "Alias symbols", convert = FALSE)
  
    # Convert chomosome info in hgnc database to number for matching with other database
    hgnc_name_prev_check <- separate(hgnc_name_prev_check, "Chromosome", sep = 'p', into = "Chrp", remove = FALSE)
    hgnc_name_prev_check <- separate(hgnc_name_prev_check, "Chromosome", sep = 'q', into = "Chrq", remove = FALSE)
    hgnc_name_prev_check <- hgnc_name_prev_check%>%
                                mutate(Chr = ifelse(nchar(hgnc_name_prev_check$Chrp) <= 2, Chrp, Chrq))
    
    hgnc_name_alias_check <- separate(hgnc_name_alias_check, "Chromosome", sep = 'p', into = "Chrp", remove = FALSE)
    hgnc_name_alias_check <- separate(hgnc_name_alias_check, "Chromosome", sep = 'q', into = "Chrq", remove = FALSE)
    hgnc_name_alias_check <- hgnc_name_alias_check%>%
                                mutate(Chr = ifelse(nchar(hgnc_name_alias_check$Chrp) <= 2, Chrp, Chrq))
  
    # For any matched approved id in the psi hg38 annotation and hgnc database, record the corresponding Ensembl ID
    annotation <-
      map(annotation, ~.x%>%
                       mutate(`ENSG.HGNC` = hgnc_name_id_match$`ENSG.ID`[match(`Gene`, hgnc_name_id_match$"Approved symbol")]))
  
    # Drop hypothetical genes
    annotation<-
      map(annotation, ~.x%>%
            subset(`Gene` != 'Hypothetical'))
  
    # IN remaining NAs, for any matched alias/previous names and chromosome in the psi hg38 annotation and hgnc database, record the corresponding Ensembl ID
    annotation <-
      map(annotation, ~.x%>%
                      mutate(ENSG.HGNC = ifelse(is.na(`ENSG.HGNC`) | `ENSG.HGNC` == "",
                                                hgnc_name_alias_check$`ENSG.ID`[match(`Gene`, hgnc_name_alias_check$"Alias symbols") & match(`Chromosome`, hgnc_name_alias_check$Chr)],
                                                `ENSG.HGNC`))%>%
                      mutate(ENSG.HGNC = ifelse(is.na(`ENSG.HGNC`) | `ENSG.HGNC` == "",
                                                hgnc_name_prev_check$`ENSG.ID`[match(`Gene`, hgnc_name_prev_check$"Previous symbols") & match(`Chromosome`, hgnc_name_prev_check$Chr)],
                                                `ENSG.HGNC`)))
  
    # Build the final Ensembl id column base on the gtf file first, then for remaining NAs check the HGNC database record, SUPPA and VASTTOOL record,
    # Drop special cases that Ensembl ID is not recorded in VASTTOOLs and SUPPA
    # finnally in the remaining NA Ensmbl ID if the gene name in original annotation is NCBI IDs just use it
    annotation <-
      map(annotation, ~.x%>%
                       mutate(`ENSG.ID` = `ENSG.GTF`)%>%
                       mutate(`ENSG.ID` = ifelse(is.na(`ENSG.ID`),
                                               `ENSG.HGNC`,
                                                `ENSG.ID`))%>%
                       mutate(`ENSG.ID` = ifelse(is.na(`ENSG.ID`),
                                               `ENSG.VAST`,
                                                `ENSG.ID`))%>%
                       mutate(`ENSG.ID` = ifelse(is.na(`ENSG.ID`),
                                               `ENSG.SUPPA`,
                                                `ENSG.ID`))%>%
                       mutate(`ENSG.ID` = ifelse(substr(`ENSG.ID`, 1, 4) == 'ENSG',
                                                 `ENSG.ID`,
                                                 NA))%>%
                       mutate(`ENSG.ID` = ifelse(is.na(`ENSG.ID`) & substr(`Gene`, 1, 3) == 'LOC',
                                               `Gene`,
                                                `ENSG.ID`)))

    # Use the Ensembl IDs to replace gene names, drop remaining NAs
    annotation <-
    map(annotation, ~.x%>%
                       mutate(`Gene` = `ENSG.ID`)%>%
            drop_na(`Gene`)
          )

    # save modified annotation
    saveRDS(annotation, file = "${_output}")

[Tad_annotateion]
## The tad file downloads from 
parameter: TAD_list = path
parameter: region_list = path
parameter: TAD_genotype = path
input: TAD_list, region_list
output:f'{cwd:a}/{region_list:bn}.annotation' 
task: trunk_workers = 1, trunk_size = job_size, walltime = walltime, mem = mem, tags = f'{step_name}_{_output:bn}'
python: container=container, expand= "${ }", stderr = f'{_output}.stderr', stdout = f'{_output}.stdout'
    import pandas as pd
    import numpy as np
    #tad = pd.read_csv("${_input[0]}","\t",header = None)
    #tad.columns = ["#chr","start","end"]
    #tad["tad_index"] = [f'tad{x}' for x in np.arange(1,len(tad)+1)]
    tad = pd.read_csv("${_input[0]}","\t")
    region_list =  pd.read_csv("${_input[1]}","\t")
    tad = tad.merge(region_list, on = "#chr").query("start_y > start_x &  end_y < end_x").drop(["start_y","end_y"] , axis = 1).rename({"start_x":"start","end_x":"end"})
    tad.to_csv("${_output}","\t") 
    tad_geno_list = pd.read_csv("${TAD_genotype}","\t")
    tad_gene_id = tad.merge(tad_geno_list,left_on = "tad_index",right_on = "#id").iloc[:,[4,6]]
    tad_gene_id.columns = ["ID","dir"]
    tad_gene_id.to_csv("${_output}.genotype_files_list.tsv","\t") 

/home/hs3163/miniconda3/lib/python3.9/site-packages/IPython/core/interactiveshell.py:3441: FutureWarning: In a future version of pandas all arguments of read_csv except for the argument 'filepath_or_buffer' will be keyword-only.
  exec(code_obj, self.user_global_ns, self.user_ns)
/home/hs3163/miniconda3/lib/python3.9/site-packages/IPython/core/interactiveshell.py:3441: FutureWarning: In a future version of pandas all arguments of read_csv except for the argument 'filepath_or_buffer' will be keyword-only.
  exec(code_obj, self.user_global_ns, self.user_ns)
/home/hs3163/miniconda3/lib/python3.9/site-packages/IPython/core/interactiveshell.py:3441: FutureWarning: In a future version of pandas all arguments of read_csv except for the argument 'filepath_or_buffer' will be keyword-only.
  exec(code_obj, self.user_global_ns, self.user_ns)

Reference Data Standardization

Contents

Reference Data Standardization#

Description#

Command interface#

1. Download reference data#

2. Format reference genome#

3. Transcript and gene model reference processing#

4. Gene annotation#

5. Generate STAR index#

6. Generate RSEM index#

7. Generate RefFlat annotation#

8. Generate SUPPA annotation#

9. Extract rsIDs and TAD annotation#

Output#

Anticipated Results#

Workflow implementation#