APA Calling

APA Calling#

Aim#

The purpose of this notebook is to call APA-based information (PDUI) based on DAPARS2 method.

Methods#

Timing: Runtime varies by dataset size and compute resources. For the toy chr22 MWE dataset, most steps complete in under 10 minutes on a standard HPC node.

%preview ../../images/apa_calling.png

3’UTR Reference#

gtf2bed12.py : Covert gtf to bed format (Source from in-house codes from Li Lab: Xu-Dong/Exon_Intron_Extractor)
DaPars_Extract_Anno.py : extract the 3UTR regions in bed formats from the whole genome bed (Source from Dapars 2: 3UTR/DaPars2)

Call WIG data from transcriptome BAM files#

Using bedtools or rsem-bam2wig, for RSEM based alignment

Config files Generation#

Python 3 loops to read line by line the sum of reads coverage of all chromosome.

Dapars2 Main Function#

Dapars2_Multi_Sample.py: use the least sqaures methods to calculate the usage of long isoforms (3UTR/DaPars2)

Note: this part of code have been modified from source to deal with some formatting discrepancy in wig file

Impute missing values in Dapars result#

KNN using impute R package.

Input#

A list of transcriptome level BAM files, eg generated by RSEM
The 3’UTR annotation reference file

If you do not have 3’UTR annotation file, please generate it first. Input to this step is the transcriptome level gene feature file in GTF format that we previously prepared.

Output#

Dapars config files
PUDI (Raw) information saved in txt
PDUI (Imputed) information saved in txt. This is recommended for further analysis.

Steps#

Step 1. Generate the 3’UTR reference regions from a GTF annotation:

sos run pipeline/apa_calling.ipynb UTR_reference \
    --cwd output/apa \
    --hg-gtf output/apa/chr22.gtf

Step 2. Convert the transcriptome BAM files into per-base coverage (wig) and read-depth (flagstat) files:

sos run pipeline/apa_calling.ipynb bam2tools \
    --cwd output/apa \
    --bam-dir output/rnaseq/bam

Step 3. Generate the DaPars2 sample configuration and mapping files:

sos run pipeline/apa_calling.ipynb APAconfig \
    --cwd output/apa \
    --bfile output/apa/wig \
    --annotation output/apa/chr22_3UTR.bed

Step 4. Run DaPars2 to calculate the PDUI matrix per chromosome:

sos run pipeline/apa_calling.ipynb APAmain \
    --cwd output/apa \
    --chrlist chr22 \
    --dapars-path code/molecular_phenotypes/calling/apa/Dapars2_Multi_Sample.py

Command interface#

sos run pipeline/apa_calling.ipynb -h

Workflow implementation#

[global]
parameter: walltime = '400h'
parameter: mem = '200G'
parameter: ncore = 16
# the output directory for generated files
parameter: cwd = path("output")
# Number of threads
parameter: numThreads = 8
parameter: job_size = 1
parameter: container = ''

Step 0. Generate 3’UTR regions based on GTF#

# Generate the 3UTR region according to the gtf file
[UTR_reference]
# gtf file
parameter: hg_gtf = path
input: hg_gtf
output: f'{cwd}/{_input:bn}.bed', 
        f'{cwd}/{_input:bn}.transcript_to_geneName.txt',
        f'{cwd}/{_input:bn}_3UTR.bed'
bash: expand = '${ }', container = container
    gtf2bed12.py --gtf ${_input} --out ${cwd}
    mv ${cwd}/gene_annotation.bed ${_output[0]}
    mv ${cwd}/transcript_to_geneName.txt ${_output[1]}
    DaPars_Extract_Anno.py -b ${_output[0]} -s ${_output[1]} -o ${_output[2]}

Step 1. Generate WIG coverage and flagstat files from BAM files#

[bam2tools]
parameter: n = 9
n  = [x for x in range(n)]
input: for_each = 'n'
task: trunk_workers = 1, trunk_size = job_size, walltime = walltime, mem = mem, cores = numThreads
python: expand = True, container = container
    import glob
    import os
    import subprocess
    path = "/mnt/mfs/ctcn/datasets/rosmap/rnaseq/dlpfcTissue/batch{_n}/STAR_aligned"
    name = glob.glob(path + "/**/*Aligned.sortedByCoord.out.bam", recursive = True)
    wigpath = "/home/ls3751/project/ls3751/wig/batch{_n}/"
    if not os.path.exists(wigpath):
        os.makedirs(os.path.dirname(wigpath))
    for i in name:
        id = i.split("/")[-2]
        filedir = path + "/" + id + "/" + id + ".bam"
        out = wigpath + id + ".wig"
        new_cmd = "bedtools genomecov -ibam " + filedir + " -bga -split -trackline" + " > " + out
        os.system(new_cmd)
        out_2 = wigpath + id + ".flagstat"
        new_cmd_2 = f"samtools flagstat --thread {numThreads} " + filedir + " > " + out_2
        os.system(new_cmd_2)

[bam2toolsv1]
parameter: sample = path
parameter: tissue = path
task: trunk_workers = 1, trunk_size = job_size, walltime = walltime, mem = mem, cores = numThreads, concurrent = True
python: expand = "${ }", container = container
    import os
    import multiprocessing
    import re
    sample_list = [line.strip('\n') for line in open("${sample}")]

    def call_wig_flagstat(subject):
        id = re.findall(r"\D(\d{8})\D",subject)[1]
        prefix = "/mnt/mfs/statgen/ls3751/aqtl_analysis/wig/" + "${tissue}" + "/"
        out = prefix + id + ".wig"
        out_2 = prefix + id + ".flagstat"
        new_cmd = "bedtools genomecov -ibam " + subject + " -bga -split -trackline" + " > " + out
        os.system(new_cmd)
        new_cmd_2 = f"samtools flagstat --thread {numThreads} " + subject + " > " + out_2
        os.system(new_cmd_2)
    
    for sample in sample_list:
        call_wig_flagstat(sample)
    ### process = multiprocessing.Process(target = call_wig_flagstat, args =(sample_list[i]))
    ### process.start()
    ### for p in processes:
    ###     p.join()

[bam2toolsv2]
parameter: sample = path
parameter: tissue = path
task: trunk_workers = 1, trunk_size = job_size, walltime = walltime, mem = mem, cores = numThreads
bash: expand = "${ }", container = container
    input="${sample}"
    while IFS=',' read -r col1 col2
    do
      bedtools genomecov -ibam $col1  -bga -split -trackline  > /mnt/mfs/statgen/ls3751/aqtl_analysis/wig/${tissue}/$col2.wig &
      samtools flagstat --thread ${numThreads} $col1  > /mnt/mfs/statgen/ls3751/aqtl_analysis/wig/${tissue}/$col2.flagstat &
    done < "$input"

[bam2toolstmp]
parameter: sample = path
parameter: tissue = path
task: trunk_workers = 1, trunk_size = job_size, walltime = walltime, mem = mem, cores = ncore
python: expand = "${ }", container = container
    import os
    import re
    sample_list = [line.strip('\n') for line in open("${sample}")]

    def copy_wigfile(subject):
        id = re.findall(r"\D(\d{8})\D",subject)[0]
        prefix = "/mnt/mfs/statgen/ls3751/aqtl_analysis/wig/" + "${tissue}" + "/"
        file1 = os.path.splitext(subject)[0] + ".flagstat"
        out = prefix + id + ".wig"
        out_2 = prefix + id + ".flagstat"
        new_cmd = "cp " + subject + " " + out
        os.system(new_cmd)
        new_cmd_2 = "cp " + file1 + " " + out_2
        os.system(new_cmd_2)
    
    for sample in sample_list:
        copy_wigfile(sample)

Step 2. Generate the DaPars2 configuration and sample mapping files#

# Generate configuration file
[APAconfig]
parameter: bfile = path
parameter: annotation = path
parameter: job_size = 1
# Default parameters for Dapars2:
parameter: least_pass_coverage_percentage = 0.3
parameter: coverage_threshold = 10
output: [f'{cwd}/sample_mapping_files.txt',f'{cwd}/sample_configuration_file.txt']
task: trunk_workers = 1, trunk_size = 1, walltime = walltime, mem = mem, cores = ncore
python3: expand = "${ }", container = container
    import re
    import os
    target_all_sample = os.listdir("${bfile}")
    target_all_sample = list(filter(lambda v: re.match('.*wig$', v), target_all_sample))
    target_all_sample = ["${bfile}" + "/" + w for w in target_all_sample]
    

    def extract_total_reads(input_flagstat_file):
        num_line = 0
        total_reads = '-1'
        #print input_flagstat_file
        for line in open(input_flagstat_file,'r'):
            num_line += 1
            if num_line == 5:
                total_reads = line.strip().split(' ')[0]
                break
        return total_reads

    #print(target_all_sample)
    print("INFO: Total",len(target_all_sample),"samples found in provided dirctory!")
    # Total depth file:
    mapping_file = open("${_output[0]}", "w")
    for current_sample in target_all_sample:
        flag = current_sample.split(".")[0] + ".flagstat"
        current_sample_total_depth = extract_total_reads(flag)
        field_out = [current_sample, str(current_sample_total_depth)]
        mapping_file.writelines('\t'.join(field_out) + '\n')

        print("Coverage of sample ", current_sample, ": ", current_sample_total_depth)
    mapping_file.close()

    # Configuration file:

    config_file = open(${_output[1]:r},"w")
    config_file.writelines(f"Annotated_3UTR=${annotation}\n")
    config_file.writelines( "Aligned_Wig_files=%s\n" % ",".join(target_all_sample))
    config_file.writelines(f"Output_directory=${cwd}/apa \n")
    config_file.writelines(f"Output_result_file=Dapars_result\n")
    config_file.writelines(f"Least_pass_coverage_percentage=${least_pass_coverage_percentage}\n")
    config_file.writelines( "Coverage_threshold=${coverage_threshold}\n")
    config_file.writelines( "Num_Threads=${numThreads}\n")
    config_file.writelines(f"sequencing_depth_file=${_output[0]}") 
    config_file.close()    

Step 3. Run DaPars2 main to calculate PDUIs#

Anticipated Results#

The pipeline produces output files in the output/ subdirectory named after the workflow step. Verify success by checking that output files exist and are non-empty. See the Output section above for the expected file names and formats.

# Call Dapars2 multi_chromosome
[APAmain]
parameter: chr_prefix = False
parameter: chrlist = list
input: for_each = 'chrlist'
output: [f'{cwd}/apa_{x}/Dapars_result_result_temp.{x}.txt' for x in chrlist], group_by = 1
task: trunk_workers = 1, trunk_size = job_size, walltime = walltime, mem = mem, cores = ncore
bash: expand = True, container = container
    python2 /mnt/mfs/statgen/ls3751/github/xqtl-protocol/code/Dapars2_Multi_Sample.py {cwd}/sample_configuration_file.txt {_chrlist} {"F" if chr_prefix else "T"}