Gene Coordinate Annotation

Gene Coordinate Annotation#

Description#

We use a gene coordinate annotation pipeline based on pyqtl, as demonstrated here. This adds genomic coordinate annotations to gene-level molecular phenotype files generated in gct format and converts them to bed format for downstreams analysis.

Alternative implementation#

Previously we use biomaRt package in R instead of code from pyqtl. The core function calls are:

    ensembl = useEnsembl(biomart = "ensembl", dataset = "hsapiens_gene_ensembl", version = "$[ensembl_version]")
    ensembl_df <- getBM(attributes=c("ensembl_gene_id","chromosome_name", "start_position", "end_position"),mart=ensembl)

We require ENSEMBL version to be specified explicitly in this pipeline. As of 2021 for the Brain xQTL project, we use ENSEMBL version 103.

Input#

Molecular phenotype data in gct format, with the first column being ENSEMBL ID and other columns being sample names.
GTF for collapsed gene model
- the gene names must be consistent with the molecular phenotype data matrices (eg ENSG00000000003 vs. ENSG00000000003.1 will not work)
(Optional) Meta-data to match between sample names in expression data and genotype files
- Tab delimited with header
- Only 2 columns: first column is sample name in expression data, 2nd column is sample name in genotype data
- must contains all the sample name in expression matrices even if they don’t existing in genotype data

Output#

Molecular phenotype data in bed format.

Minimal Working Example Steps#

The MWE is uploaded to Synapse

i. Cooridnate Annotation#

a. Gene Expression#

Timing: <1 min

sos run pipeline/gene_annotation.ipynb annotate_coord \
    --cwd output/rnaseq \
    --phenoFile output/rnaseq/bulk_rnaseq_tmp_matrix.low_expression_filtered.outlier_removed.tmm.expression.bed.gz \
    --coordinate-annotation reference_data/Homo_sapiens.GRCh38.103.chr.reformatted.collapse_only.gene.ERCC.gtf \
    --phenotype-id-column gene_id
    

INFO: Running annotate_coord: 
INFO: annotate_coord is completed.
INFO: annotate_coord output:   /restricted/projectnb/xqtl/xqtl_protocol/toy_xqtl_protocol/output/rnaseq/bulk_rnaseq_tmp_matrix.low_expression_filtered.outlier_removed.tmm.expression.bed.bed.gz /restricted/projectnb/xqtl/xqtl_protocol/toy_xqtl_protocol/output/rnaseq/bulk_rnaseq_tmp_matrix.low_expression_filtered.outlier_removed.tmm.expression.bed.region_list.txt
INFO: Workflow annotate_coord (ID=wc0f3b36281bafaba) is executed successfully with 1 completed step.
[fgrennjr@scc-ym2 toy_xqtl_protocol]$ cd output/rnaseq/

b. Proteins#

Timing: <1 min

!sos run gene_annotation.ipynb annotate_coord \
    --cwd ../../../output_test/phenotype_by_chrom \
    --phenoFile ../../../mwe_data/protocol_data/input/xqtl_association/protocol_example.protein.csv \
    --coordinate-annotation ../../../reference_data/Homo_sapiens.GRCh38.103.chr.reformatted.collapse_only.gene.ERCC.gtf \
    --sample-participant-lookup ../../../mwe_data/protocol_data/output/protocol_example.protein.sample_overlap.txt \
    --phenotype-id-column gene_name \
    --molecular-trait-type protein \
    --sep ","

INFO: Running annotate_coord_protein: 
INFO: tbecf36ea3e901ea1 re-execute completed
INFO: tbecf36ea3e901ea1 submitted to neurology with job id Your job 3961996 ("job_tbecf36ea3e901ea1") has been submitted
INFO: Waiting for the completion of 1 task.
INFO: Waiting for the completion of 1 task.
INFO: annotate_coord_protein output:   /restricted/projectnb/xqtl/xqtl_protocol/output_test/phenotype_by_chrom/protocol_example.protein.bed.gz /restricted/projectnb/xqtl/xqtl_protocol/output_test/phenotype_by_chrom/protocol_example.protein.region_list.txt
INFO: Workflow annotate_coord_protein (ID=w75d21004c6718efd) is executed successfully with 1 completed step and 1 completed task.

Troubleshooting#

Step	Substep	Problem	Possible Reason	Solution

Command Interface#

!sos run gene_annotation.ipynb -h

usage: sos run gene_annotation.ipynb
               [workflow_name | -t targets] [options] [workflow_options]
  workflow_name:        Single or combined workflows defined in this script
  targets:              One or more targets to generate
  options:              Single-hyphen sos parameters (see "sos run -h" for details)
  workflow_options:     Double-hyphen workflow-specific parameters

Workflows:
  region_list_generation
  annotate_coord
  annotate_coord_biomart
  map_leafcutter_cluster_to_gene
  annotate_leafcutter_isoforms
  annotate_psichomics_isoforms

Global Workflow Options:
  --cwd output (as path)
                        Work directory & output directory
  --phenoFile VAL (as path, required)
                        Molecular phenotype matrix
  --job-size 1 (as int)
                        For cluster jobs, number commands to run per job
  --walltime 5h
                        Wall clock time expected
  --mem 16G
                        Memory expected
  --numThreads 1 (as int)
                        Number of threads
  --container ''
  --entrypoint ''

Sections
  region_list_generation:
  annotate_coord:
    Workflow Options:
      --sample-participant-lookup . (as path)
                        A file to map sample ID from expression to genotype,
                        must contain two columns, sample_id and participant_id,
                        mapping IDs in the expression files to IDs in the
                        genotype (these can be the same).
      --molecular-trait-type gene
                        Options: gene, protein, atac
      --coordinate-annotation VAL (as path, required)
                        gtf annotation for RNA-seq data; other types of
                        annotation index for protein and atac-seq
      --phenotype-id-column 'gene_id'
                        gene_id or gene_name for RNA-seq data, SOMAseqID for
                        proteins for example
      --auxiliary-id-mapping . (as path)
                        Optional mapping file (protein_name_index for protein,
                        etc.)
      --[no-]strip-id (default to False)
                        Optional processing flag
  annotate_coord_biomart:
    Workflow Options:
      --ensembl-version VAL (as int, required)
  map_leafcutter_cluster_to_gene:
    Workflow Options:
      --intron-count VAL (as path, required)
                        Extract the code in case psichromatic needs to be
                        processed the same way PheoFile in this step is the
                        intron_count file
      --map-stra site
                        Defines the mapping strategy options: 'site' or
                        'region', with 'site' as the default. The 'site'
                        strategy maps introns to the start and end of each exon.
                        The 'region' strategy, to be recommended in leafcutter2,
                        maps each intron based on it overlapping more than
                        overlap_ratio of a gene's region.
      --overlap-ratio 0.8 (as float)
                        Define the overlap ratio as the proportion of the
                        cluster length that intersects with a gene, used to
                        determine mapping to the gene.
  annotate_leafcutter_isoforms:
    Workflow Options:
      --sample-participant-lookup . (as path)
  annotate_psichomics_isoforms:
    Workflow Options:
      --sample-participant-lookup . (as path)

Setup and global parameters#

[global]
# Work directory & output directory
parameter: cwd = path("output")
# Molecular phenotype matrix
parameter: phenoFile = path
# For cluster jobs, number commands to run per job
parameter: job_size = 1
# Wall clock time expected
parameter: walltime = "5h"
# Memory expected
parameter: mem = "16G"
# Number of threads
parameter: numThreads = 1
parameter: container = ""
parameter: entrypoint= ""

Annotate coord#

Input:#

A gtf file used to generated the bed
A phenotype bed file, must have a gene_id column indicating the name of genes.

Output:#

annotated phenotype bed file
Optional: a tsv file only has the first several columns indicating phenotype_id and names. For RNA-seq, it contains five columns: chr,start,end,gene_id,gene_name

Implementation using `pyqtl`#

Implementation based on GTEx pipeline.

Following step serves to annotate coord for gene expression file.

[annotate_coord]
# A file to map sample ID from expression to genotype, must contain two columns, sample_id and participant_id, mapping IDs in the expression files to IDs in the genotype (these can be the same).
parameter: sample_participant_lookup = path()
# Options: gene, protein, atac
parameter: molecular_trait_type = "gene" 
# gtf annotation for RNA-seq data; other types of annotation index for protein and atac-seq
parameter: coordinate_annotation = path
# gene_id or gene_name for RNA-seq data, SOMAseqID for proteins for example
parameter: phenotype_id_column = "gene_id"
# Optional mapping file (protein_name_index for protein, etc.)
parameter: auxiliary_id_mapping = path()
# Optional processing flag
parameter: strip_id = False
parameter: sep = "\t"

input: phenoFile, coordinate_annotation
output: f'{cwd:a}/{_input[0]:bn}.bed.gz', f'{cwd:a}/{_input[0]:bn}.region_list.txt'
task: trunk_workers = 1, trunk_size = job_size, walltime = walltime, mem = mem, tags = f'{step_name}_{_output[0]:bn}'
python: expand= "${ }", stderr = f'{_output[0]:n}.stderr', stdout = f'{_output[0]:n}.stdout'

    import pandas as pd
    import numpy as np
    import qtl.io
    from pathlib import Path
    from collections import defaultdict
    import warnings

    # Function to convert gtf to bed
    def gtf_to_bed(annotation_gtf, feature='gene', exclude_chrs=[], phenotype_id='gene_id'):
        """
        Parse genes from GTF and return DataFrame with min start and max end positions.
        For each gene, uses the smallest position as start and largest as end position.
        """
        
        # Data collections
        gene_data = defaultdict(lambda: {'chr': '', 'start': float('inf'), 'end': 0, 'strand': ''})
        gene_names = {}  # To store gene_name for each gene_id
        
        if annotation_gtf.endswith('.gz'):
            opener = gzip.open(annotation_gtf, 'rt')
        else:
            opener = open(annotation_gtf, 'r')
            
        with opener as gtf:
            for row in gtf:
                row = row.strip().split('\t')
                if row[0][0] == '#' or row[2] != feature: 
                    continue  # Skip header or non-matching features
                
                # Parse attributes
                attributes = defaultdict()
                for a in row[8].replace('"', '').split(';')[:-1]:
                    kv = a.strip().split(' ')
                    if kv[0] != 'tag':
                        attributes[kv[0]] = kv[1]
                    else:
                        attributes.setdefault('tags', []).append(kv[1])
                
                # Get gene identifiers
                curr_gene_id = attributes['gene_id']
                curr_gene_name = attributes['gene_name']
                gene_names[curr_gene_id] = curr_gene_name
                
                # Update gene data
                data = gene_data[curr_gene_id]
                if data['chr'] == '':  # First entry for this gene
                    data['chr'] = row[0]
                    data['strand'] = row[6]
                
                # Update min start and max end
                start_pos = int(row[3]) - 1  # Convert to 0-based for BED
                end_pos = int(row[4]) - 1  # Both start and end are used
                data['start'] = min(data['start'], start_pos)
                data['end'] = max(data['end'], end_pos)
        
        # Convert to dataframe format
        chrom, start, end, ids, names, strand = [], [], [], [], [], []
        for gene_id, data in gene_data.items():
            chrom.append(data['chr'])
            start.append(data['start'])
            end.append(data['end'])
            ids.append(gene_id)
            names.append(gene_names[gene_id])
            strand.append(data['strand'])
        
        # Create DataFrame based on phenotype_id
        if phenotype_id == 'gene_id':
            bed_df = pd.DataFrame({
                'chr': chrom,
                'start': start,
                'end': end,
                'gene_id': ids,
                'strand': strand
            }, columns=['chr', 'start', 'end', 'gene_id', 'strand'], index=ids)
        elif phenotype_id == 'gene_name':
            bed_df = pd.DataFrame({
                'chr': chrom,
                'start': start,
                'end': end,
                'gene_id': names,
                'strand': strand
            }, columns=['chr', 'start', 'end', 'gene_id', 'strand'], index=names)
        
        # Filter out excluded chromosomes
        mask = np.ones(len(chrom), dtype=bool)
        for k in exclude_chrs:
            mask = mask & (bed_df['chr'] != k)
        bed_df = bed_df[mask]
        
        # Sort by chromosome and start position
        bed_df = bed_df.groupby('chr', sort=False, group_keys=False).apply(lambda x: x.sort_values(['start', 'end']))
        
        return bed_df

    def prepare_bed(df, bed_template_df, chr_subset=None):
        bed_df = pd.merge(bed_template_df, df, left_index=True, right_index=True)
        bed_df = bed_df.groupby('#chr', sort=False, group_keys=False).apply(lambda x: x.sort_values(['start','end']))
        if chr_subset is not None:
            bed_df = bed_df[bed_df.chr.isin(chr_subset)]
        return bed_df

    def load_and_preprocess_data(input_path, drop_columns):
        df = pd.read_csv(input_path, sep="${sep}", skiprows=0)
        dc = [col for col in df.columns if col in drop_columns] # Take intersection between df.columns and drop_columns
        df = df.drop(dc, axis=1) # drop the intersect
        if len(df.columns) < 2:
            raise ValueError("There are too few columns in the loaded dataframe, please check the delimiter of the input file. The default delimiter is tab")
        return df

    def rename_samples_using_lookup(df, lookup_path):
        lookup_file = Path(lookup_path)
        
        if not lookup_file.is_file():
            return df
            
        if lookup_file.suffix == '.csv':
            sep = ','
        elif lookup_file.suffix == '.tsv':
            sep = '\t'
        else:
            sep = '\t'  # Default to tab if unknown extension
            
        try:
            sample_participant_lookup = pd.read_csv(lookup_file, sep=sep, dtype={0:str, 1:str})
            if "genotype_id" in sample_participant_lookup.columns:
                sample_participant_lookup = sample_participant_lookup.set_index(sample_participant_lookup.columns[1])
                df.rename(columns=sample_participant_lookup.to_dict()["genotype_id"], inplace=True)
            else:
                # Fall back to the no header style
                sample_participant_lookup = pd.read_csv(lookup_file, sep=sep, header=None, dtype={0:str, 1:str})
                rename_dict = dict(zip(sample_participant_lookup[0], sample_participant_lookup[1]))
                df.rename(columns=rename_dict, inplace=True)
        except Exception as e:
            print(f"Warning: Error processing sample lookup file: {e}")
        return df

    def load_bed_template_from_gtf(input_path, phenotype_id_column):
        if sum(gtf_to_bed(input_path, feature='gene', phenotype_id="gene_id").index.duplicated()) > 0:
            raise ValueError(f"gtf file {input_path} needs to be collapsed into gene model by reference data processing module")

        bed_template_df_id = gtf_to_bed(input_path, feature='transcript', phenotype_id="gene_id")
        bed_template_df_name = gtf_to_bed(input_path, feature='transcript', phenotype_id="gene_name")
        bed_template_df = bed_template_df_id.merge(bed_template_df_name, on=["chr", "start", "end", "strand"])
        bed_template_df.columns = ["#chr", "start", "end", "gene_id", "strand", "gene_name"]
        bed_template_df = bed_template_df.set_index(phenotype_id_column, drop=False)
        return bed_template_df

    drop_cols = ["#chr", "chr", "start", "end", "stop", "annot.seqnames", "annot.start", "annot.end"]
    df = load_and_preprocess_data(${_input[0]:ar}, drop_cols)
    molecular_trait_type = "${molecular_trait_type}"
    df.set_index(df.columns[0], inplace=True)
    if ${True if sample_participant_lookup.is_file() else False}:
        df = rename_samples_using_lookup(df, "${sample_participant_lookup:a}")    
    # First column is always the phenotype ID column
    phenotype_id = df.columns.values[0]
    
    # Type-specific processing
    if molecular_trait_type == "gene":
        if ${strip_id}:
            # Update the index by stripping the pattern "A | B" to "B"
            df.index = df.index.map(lambda x: x.split('|')[-1].strip() if '|' in x else x)
            
        bed_template_df = load_bed_template_from_gtf(${_input[1]:ar}, "${phenotype_id_column}")
        bed_df = prepare_bed(df, bed_template_df).drop("gene_name", axis=1)
        bed_df = bed_df.drop_duplicates("gene_id", keep=False)
        bed_df = bed_df.rename(columns={'gene_id': 'ID'})
        
    elif molecular_trait_type == "protein":
        auxiliary_mapping = Path("${auxiliary_id_mapping:a}")
        
        if auxiliary_mapping.is_file():
            # Use protein name mapping from auxiliary file
            df_info = pd.read_csv(auxiliary_mapping).rename(
                columns={phenotype_id: phenotype_id, 'EntrezGeneSymbol': 'gene_name'}
            )[['gene_name', phenotype_id, 'UniProt']]
            df = df_info.merge(df, on=phenotype_id).drop(phenotype_id, axis=1)
        else:
            # Extract UniProt ID from phenotype ID if no mapping file
            df[[phenotype_id, 'UniProt']] = df[phenotype_id].astype(str).str.split('|', 1, expand=True)
            
        bed_template_df = load_bed_template_from_gtf(${_input[1]:ar}, "${phenotype_id_column}")
        bed_df = prepare_bed(df, bed_template_df)
        bed_df["ID"] = bed_df["gene_id"] + "_" + bed_df["UniProt"]
        bed_df = bed_df.drop_duplicates("ID", keep=False)[["#chr", "start", "end", "ID"] + df.drop(["UniProt"], axis=1).columns.values.tolist()]
        
    elif molecular_trait_type == "atac":
        # For ATAC, the coordinate_annotation is directly a BED file
        bed_template_df = pd.read_csv("${_input[1]}", sep="\t").set_index("ID")
        bed_template_df = bed_template_df.assign(ID=bed_template_df.index)
        bed_df = prepare_bed(df, bed_template_df)
        bed_df = bed_df.drop_duplicates("ID", keep=False)
    else:
        raise ValueError(f"Unsupported molecular_trait_type: {molecular_trait_type}")
        
    qtl.io.write_bed(bed_df, ${_output[0]:r})
    bed_df[["#chr", "start", "end", "ID", "strand"]].assign(path=${_output[0]:r}).to_csv(${_output[1]:r}, "\t", index=False)

    # If annotating gene or protein, save a gene_list.tsv for gene partitioning
    if molecular_trait_type in ["gene", "protein"]:
        # Save the region list
        region_list = bed_template_df[bed_template_df.${phenotype_id_column}.isin(df.index)]
        region_list.to_csv('${cwd:a}/${_input[0]:bn}.gene_list.tsv', sep="\t", index=False)
    else:
        # For ATAC, we cannot partition it by gene.
        warnings.warn(f"Gene partitioning is not applicable for molecular trait type: {molecular_trait_type}. No gene_list.tsv will be generated.")
        


bash: expand= "$[ ]", stderr = f'{_output[0]:n}.stderr', stdout = f'{_output[0]:n}.stdout', container = container, entrypoint=entrypoint
        stdout=$[_output[0]:n].stdout
        for i in $[_output[0]] ; do 
        echo "output_info: $i " >> $stdout;
        echo "output_size:" `ls -lh $i | cut -f 5  -d  " "`   >> $stdout;
        echo "output_rows:" `zcat $i | wc -l  | cut -f 1 -d " "`   >> $stdout;
        echo "output_headerow:" `zcat $i | grep "##" | wc -l `   >> $stdout;
        echo "output_column:" `zcat $i | grep -V "##" | head -1 | wc -w `   >> $stdout;
        echo "output_preview:"   >> $stdout;
        zcat $i  | grep -v "##" | head  | cut -f 1,2,3,4,5,6   >> $stdout ; done
        for i in $[_output[1]] ; do 
        echo "output_info: $i " >> $stdout;
        echo "output_size:" `ls -lh $i | cut -f 5  -d  " "`   >> $stdout;
        echo "output_rows:" `cat $i | wc -l  | cut -f 1 -d " "`   >> $stdout;
        echo "output_headerow:" `cat $i | grep "##" | wc -l `   >> $stdout;
        echo "output_column:" `cat $i | grep -V "##" | head -1 | wc -w `   >> $stdout;
        echo "output_preview:"   >> $stdout;
        cat $i  | grep -v "##" | head  | cut -f 1,2,3,4,5,6   >> $stdout ; done

Implementation using biomaRt#

This workflow adds the annotations of chr pos(TSS where start = end -1) and gene_ID to the bed file. This workflow is obsolete.

[annotate_coord_biomart]
parameter: ensembl_version=int
input: phenoFile
output: f'{cwd:a}/{_input:bn}.bed.gz',
        f'{cwd:a}/{_input:bn}.region_list.txt'
task: trunk_workers = 1, trunk_size = job_size, walltime = walltime,  mem = mem, tags = f'{step_name}_{_output[0]:bn}'  
R:  expand= "$[ ]", stderr = f'{_output[0]:n}.stderr', stdout = f'{_output[0]:n}.stdout' ,container = container, entrypoint=entrypoint

    library(biomaRt)
    library(dplyr)
    library(readr)

    # Clear biomart cache
    biomartCacheClear()

    # Read gene expression data
    gene_exp <- read_delim("$[_input[0]]", delim = "\t")

    # If column "#chr" exists, remove the first 3 columns (chr, start, and end)
    if("#chr" %in% colnames(gene_exp)){
      gene_exp <- gene_exp[, 4:ncol(gene_exp)]
    }

    # Connect to Ensembl biomart
    ensembl <- useEnsembl(biomart = "ensembl", dataset = "hsapiens_gene_ensembl", version = "$[ensembl_version]")

    # Get gene annotations from Ensembl
    ensembl_df <- getBM(attributes = c("ensembl_gene_id", "chromosome_name", "start_position", "end_position"), mart = ensembl)

    # Filter and rename columns
    my_genes_ann <- ensembl_df %>%
      filter(ensembl_gene_id %in% gene_exp$gene_ID, chromosome_name %in% 1:23) %>%
      rename("#chr" = chromosome_name, 
            "start" = start_position, 
            "end" = end_position, 
            "gene_ID" = ensembl_gene_id) %>%
      filter(gene_ID != "NA")

    # Save the annotations
    my_genes_ann %>%
      select(`#chr`, start, end, gene_ID) %>%
      write_delim(path = "$[_output[1]]", "\t")

    # Merge the annotations with the gene expression data, after modifying 'end' to be 'start + 1'
    my_gene_bed <- my_genes_ann %>%
      mutate(end = start + 1) %>%
      inner_join(gene_exp, by = "gene_ID") %>%
      arrange(`#chr`, start)

    # Save the final merged data
    write_tsv(my_gene_bed, path = "$[_output[0]:n]", na = "NA", append = FALSE, col_names = TRUE, quote_escape = "double")


bash: expand = "$[ ]", stderr = f'{_output[0]}.stderr', stdout = f'{_output[0]}.stdout',container = container, entrypoint=entrypoint
    bgzip -f $[_output[0]:n]
    tabix -p bed $[_output[0]] -f

Annotation of leafcutter isoform#

The following steps processed the output files of leafcutter so that they are TensorQTL ready. Shown below are three intemediate files

Exon list

chr	start	end	strand	gene_id	gene_name
chr1	29554	30039	+	ENSG00000243485	MIR1302-2HG
chr1	30564	30667	+	ENSG00000243485	MIR1302-2HG
chr1	30976	31097	+	ENSG00000243485	MIR1302-2HG
chr1	35721	36081	-	ENSG00000237613	FAM138A
chr1	35277	35481	-	ENSG00000237613	FAM138A
chr1	34554	35174	-	ENSG00000237613	FAM138A
chr1	65419	65433	+	ENSG00000186092	OR4F5
chr1	65520	65573	+	ENSG00000186092	OR4F5
chr1	69037	71585	+	ENSG00000186092	OR4F5

clusters_to_genes

clu	genes
1:clu_1_+	ENSG00000116288
1:clu_10_+	ENSG00000143774
1:clu_11_+	ENSG00000143774
1:clu_12_+	ENSG00000143774
1:clu_14_-	ENSG00000126709
1:clu_15_-	ENSG00000121753
1:clu_16_-	ENSG00000121753
1:clu_17_-	ENSG00000116560
1:clu_18_-	ENSG00000143549

phenotype_group

X1	X2
7:102476270:102478811:clu_309_-:ENSG00000005075	ENSG00000005075
7:102476270:102478808:clu_309_-:ENSG00000005075	ENSG00000005075
X:47572961:47574002:clu_349_-:ENSG00000008056	ENSG00000008056
X:47572999:47574002:clu_349_-:ENSG00000008056	ENSG00000008056
8:27236905:27239971:clu_322_-:ENSG00000015592	ENSG00000015592
8:27239279:27239971:clu_322_-:ENSG00000015592	ENSG00000015592
8:27241262:27241677:clu_323_-:ENSG00000015592	ENSG00000015592
8:27241262:27242397:clu_323_-:ENSG00000015592	ENSG00000015592
8:27241757:27242397:clu_323_-:ENSG00000015592	ENSG00000015592
1:35558223:35559107:clu_4_+:ENSG00000020129	ENSG00000020129

The gtf used here should be the collapsed gtf, i.e. the final output of reference_data gtf processing and the one used to called rnaseq.

[map_leafcutter_cluster_to_gene]
## Extract the code in case psichromatic needs to be processed the same way
## PheoFile in this step is the intron_count file
parameter: intron_count = path
# Defines the mapping strategy options: 'site' or 'region', with 'site' as the default. 
# The 'site' strategy maps introns to the start and end of each exon. 
# The 'region' strategy, to be recommended in leafcutter2, maps each intron based on it overlapping more than overlap_ratio of a gene's region.
parameter: map_stra = "site" 
# Define the overlap ratio as the proportion of the cluster length that intersects with a gene, used to determine mapping to the gene.
parameter: overlap_ratio = 0.8
input: intron_count, annotation_gtf
output: f'{cwd}/{_input[0]:b}.exon_list', f'{cwd}/{_input[0]:b}.leafcutter.clusters_to_genes.txt'
task: trunk_workers = 1, trunk_size = job_size, walltime = walltime,  mem = mem, tags = f'{step_name}_{_output[0]:bn}'  
python: expand= "${ }", stderr = f'{_output[0]:n}.stderr', stdout = f'{_output[0]:n}.stdout', container = container, entrypoint=entrypoint
    import pandas as pd
    import qtl.annotation

    gtf = ${_input[1]:r}
    #there is no "gene_type" when using the uncollapsed gtf. Replace "gene_biotype" with "gene_type" in the gtf and temporarily save for use in the annotation
    has_gene_type = True
    with open(gtf, 'r') as file:
        for line in file:
            if line[0] == '#':
                continue

            row = line.strip().split('\t')
            chrom = row[0]
            # source = row[1]
            annot_type = row[2]

            if annot_type == "gene":
                if "gene_type" not in line:
                    has_gene_type = False
                break

    if has_gene_type == False:
        
        with open(gtf, 'r') as file:
            file_contents = file.read()
        updated_contents = file_contents.replace("gene_biotype", "gene_type")
        gtf = ${_input[1]:rn}.tmp${_input[1]:rx}
        with open(gtf, 'w') as file:
            file.write(updated_contents)


    # Load data
    #annot = qtl.annotation.Annotation(${_input[1]:r})
    annot = qtl.annotation.Annotation(gtf)
    exon_df = pd.DataFrame([[g.chr, e.start_pos, e.end_pos, g.strand, g.id, g.name]
                        for g in annot.genes for e in g.transcripts[0].exons],
                       columns=['chr', 'start', 'end', 'strand', 'gene_id', 'gene_name'])
    exon_df.to_csv(${_output[0]:r}, sep='\t', index=False)
    
    
R:expand= "${ }", stderr = f'{_output[0]:n}.stderr', stdout = f'{_output[0]:n}.stdout', container = container, entrypoint=entrypoint
    suppressMessages(library(dplyr, quietly=TRUE))
    suppressMessages(library(stringr, quietly=TRUE))
    suppressMessages(library(foreach, quietly=TRUE))
    # leafcutter functions:
    
    #' Make a data.frame of meta data about the introns
    #' @param introns Names of the introns
    #' @return Data.frame with chr, start, end, cluster id
    #' @export
    get_intron_meta <- function(introns) {
      intron_meta <- do.call(rbind, strsplit(introns, ":"))  
      if (ncol(intron_meta) == 5) {
        colnames(intron_meta) <- c("chr", "start", "end", "clu", "category")
      } else {
        colnames(intron_meta) <- c("chr", "start", "end", "clu")
      }

      intron_meta <- as.data.frame(intron_meta, stringsAsFactors = FALSE)
      intron_meta$start <- as.numeric(intron_meta$start)
      intron_meta$end <- as.numeric(intron_meta$end)

      return(intron_meta)
    }
    #' Work out which gene each cluster belongs to. Note the chromosome names used in the two inputs must match.
    #' @param intron_meta Data frame describing the introns, usually from get_intron_meta
    #' @param exons_table Table of exons, see e.g. /data/gencode19_exons.txt.gz
    #' @return Data.frame with cluster ids and genes separated by commas
    #' @import dplyr
    #' @export
    map_clusters_to_genes_site <- function(intron_meta, exons_table) {
      suppressMessages(library(foreach, quietly=TRUE))
      gene_df <- foreach (chr=sort(unique(intron_meta$chr)), .combine=rbind) %dopar% {
    
        intron_chr <- intron_meta[ intron_meta$chr==chr, ]
        exons_chr <- exons_table[exons_table$chr==chr, ]
    
        exons_chr$temp <- exons_chr$start
        intron_chr$temp <- intron_chr$end
        three_prime_matches <- inner_join( intron_chr, exons_chr, by="temp")
    
        exons_chr$temp <- exons_chr$end
        intron_chr$temp <- intron_chr$start
        five_prime_matches <- inner_join( intron_chr, exons_chr, by="temp")
    
        all_matches <- rbind(three_prime_matches, five_prime_matches)[ , c("clu", "gene_name")]
    
        all_matches <- all_matches[!duplicated(all_matches),]
    
        if (nrow(all_matches)==0) return(NULL)
        all_matches$clu <- paste(chr,all_matches$clu,sep=':')
        all_matches
      }
    
      clu_df <- gene_df %>% group_by(clu) %>% summarize(genes=paste(gene_name, collapse = ","))
      class(clu_df) <- "data.frame"
      clu_df
    }
    
    map_clusters_to_genes_region <- function(intron_meta, exons_table, f) {
        #combine the exon position to get gene region
        merged_df <- exons_table %>%
            group_by(chr, gene_id) %>%
            summarize(start = min(start), end = max(end), .groups = 'drop') %>% as.data.frame()%>%
            .[,c("chr","start","end","gene_id")]
        # use bedtools to map the intron region to gene region 
         options(bedtools.path = "/home/rf2872/software/bedtools2/bin/") #Fixme: build bedtools in leafcutter sif
                map_res <- bedtoolsr::bt.intersect(a = intron_meta, b = merged_df,f = f, wa = TRUE, wb=TRUE)
        #organize output of mapped file
        exon_map <- map_res %>% 
            mutate(clu = paste(V1, V4, sep=":"), 
                   genes = .[[paste0("V", 4 + ncol(intron_meta))]]) %>%
                select(clu,genes)
          return(exon_map)
    }

    cat("LeafCutter: mapping clusters to genes\n")
    intron_counts <- read.table(${_input[0]:r}, header=TRUE, check.names=FALSE, row.names=4)
    intron_meta <- get_intron_meta(rownames(intron_counts))
    exon_table <- read.table(${_output[0]:r}, header=TRUE, stringsAsFactors=FALSE)
    if(!str_detect(intron_meta$chr[1],"chr")) {
        exon_table = exon_table %>% mutate(chr = str_remove_all(chr,"chr"))
    } else if (!any(str_detect(exon_table$chr[1],"chr"))) {
        exon_table = exon_table %>% mutate(chr = paste0("chr",chr))
    } else {exon_table = exon_table}
    stopifnot(is.element('gene_id', colnames(exon_table)))
    exon_table[, 'gene_name'] <- exon_table[, 'gene_id']
    if("${map_stra}" == "site"){
      cat("mapping clusters to genes by splicing site\n")
      m <- map_clusters_to_genes_site(intron_meta, exon_table)
      } else if("${map_stra}" == "region"){
      cat("mapping clusters to genes by overlapping gene region\n")
      m <- map_clusters_to_genes_region(intron_meta, exon_table, f = ${overlap_ratio})
    } else {
      stop("Map strategy should only be 'site' or 'region'.")
    }
    write.table(m, ${_output[1]:r}, sep = "\t", quote=FALSE, row.names=FALSE)
  
bash: expand= "$[ ]", stderr = f'{_output[0]:n}.stderr', stdout = f'{_output[0]:n}.stdout', container = container, entrypoint=entrypoint
        stdout=$[_output[0]:n].stdout
        for i in $[_output] ; do 
        echo "output_info: $i " >> $stdout;
        echo "output_size:" `ls -lh $i | cut -f 5  -d  " "`   >> $stdout;
        echo "output_rows:" `zcat $i | wc -l  | cut -f 1 -d " "`   >> $stdout;
        echo "output_headerow:" `zcat $i | grep "##" | wc -l `   >> $stdout;
        echo "output_column:" `zcat $i | grep -V "##" | head -1 | wc -w `   >> $stdout;
        echo "output_preview:"   >> $stdout;
        zcat $i  | grep -v "##" | head  | cut -f 1,2,3,4,5,6   >> $stdout ; done

[annotate_leafcutter_isoforms]
parameter: sample_participant_lookup = path()
input: phenoFile, annotation_gtf,output_from("map_leafcutter_cluster_to_gene")
output: f'{cwd:a}/{_input[0]:bn}.formated.bed.gz', f'{cwd:a}/{_input[0]:bn}.phenotype_group.txt'
task: trunk_workers = 1, trunk_size = job_size, walltime = walltime,  mem = mem, tags = f'{step_name}_{_output[0]:bn}'  
python: expand= "${ }", stderr = f'{_output[0]:n}.stderr', stdout = f'{_output[0]:n}.stdout', container = container, entrypoint=entrypoint
    import pandas as pd
    import numpy as np
    import qtl.io
    import os
    from pathlib import Path

    # Load data
    if os.path.exists(${_input[1]:rn}.tmp${_input[1]:rx}):
        gtf = ${_input[1]:rn}.tmp${_input[1]:rx}
    else:
        gtf = ${_input[1]:r}
    #tss_df = qtl.io.gtf_to_tss_bed(${_input[1]:r})
    tss_df = qtl.io.gtf_to_tss_bed(gtf)
    bed_df = pd.read_csv(${_input[0]:ar}, sep='\t', skiprows=0)
    bed_df.columns.values[0] = "#chr" # Temporary
    sample_participant_lookup = Path("${sample_participant_lookup:a}")
    cluster2gene_dict = pd.read_csv(${_input[3]:r}, sep='\t', index_col=0).to_dict()
    cluster2gene_dict = cluster2gene_dict['genes']
    print('    ** assigning introns to gene mapping(s)')
    n = 0
    gene_bed_df = []
    group_s = {}
    for _,r in bed_df.iterrows():
        s = r['ID'].split(':')
        cluster_id = s[0]+':'+s[3]
        if cluster_id in cluster2gene_dict:
            gene_ids = cluster2gene_dict[cluster_id].split(',')
            for g in gene_ids:
                gi = r['ID']+':'+g
                gene_bed_df.append(tss_df.loc[g, ['chr', 'start', 'end']].tolist() + [gi] + r.iloc[4:].tolist())
                group_s[gi] = g
        else:
            n += 1
    if n > 0:
        print(f'    ** discarded {n} introns without a gene mapping')

    print('  * writing BED files for QTL mapping')
    gene_bed_df = pd.DataFrame(gene_bed_df, columns=bed_df.columns)
    # sort by TSS
    gene_bed_df = gene_bed_df.groupby('#chr', sort=False, group_keys=False).apply(lambda x: x.sort_values('start'))
    #rename the samples if they named by file name (simply pick the first element with [.])
    gene_bed_df.columns = list(gene_bed_df.columns[:4]) + [name.split('.')[0] if 'junc' in name else name for name in gene_bed_df.columns[4:]]
    # change sample IDs to participant IDs
    if sample_participant_lookup.is_file():
        sample_participant_lookup_s = pd.read_csv(sample_participant_lookup, sep="\t", index_col=0, dtype={0:str,1:str})
        #gene_bed_df.rename(columns=sample_participant_lookup_s.to_dict(), inplace=True)
        # Create a dictionary mapping from sample_id to participant_id
        column_mapping = dict(zip(sample_participant_lookup_s.index, sample_participant_lookup_s['participant_id']))
        # Get the column names to be replaced
        column_names = gene_bed_df.columns[4:]
        # Replace the column names using the mapping dictionary
        #new_column_names = [column_mapping.get(col, 'missing_data') for col in column_names]
        #it should overlap with genotype in downstream anyways
        new_column_names = [column_mapping.get(col, col) for col in column_names]
        gene_bed_df.rename(columns=dict(zip(column_names, new_column_names)), inplace=True)

    gene_bed_df = gene_bed_df.drop_duplicates()
    qtl.io.write_bed(gene_bed_df, ${_output[0]:r})
    gene_bed_df[['start', 'end']] = gene_bed_df[['start', 'end']].astype(np.int32)
    gene_bed_df[gene_bed_df.columns[4:]] = gene_bed_df[gene_bed_df.columns[4:]].astype(np.float32)
    group_s_df =  pd.Series(group_s).sort_values().reset_index()
    group_s_df.columns = ['ID', 'gene']
    group_s_df.to_csv(${_output[1]:r}, sep='\t', index=False, header=True)

bash: expand= "$[ ]", stderr = f'{_output[0]:n}.stderr', stdout = f'{_output[0]:n}.stdout', container = container, entrypoint=entrypoint
        stdout=$[_output[0]:n].stdout

        rm $[_input[1]:rn].tmp$[_input[1]:rx]

        for i in $[_output[0]] ; do 
        echo "output_info: $i " >> $stdout;
        echo "output_size:" `ls -lh $i | cut -f 5  -d  " "`   >> $stdout;
        echo "output_rows:" `zcat $i | wc -l  | cut -f 1 -d " "`   >> $stdout;
        echo "output_headerow:" `zcat $i | grep "##" | wc -l `   >> $stdout;
        echo "output_column:" `zcat $i | grep -V "##" | head -1 | wc -w `   >> $stdout;
        echo "output_preview:"   >> $stdout;
        zcat $i  | grep -v "##" | head  | cut -f 1,2,3,4,5,6   >> $stdout ; done
        for i in $[_output[1]] ; do 
        echo "output_info: $i " >> $stdout;
        echo "output_size:" `ls -lh $i | cut -f 5  -d  " "`   >> $stdout;
        echo "output_rows:" `cat $i | wc -l  | cut -f 1 -d " "`   >> $stdout;
        echo "output_headerow:" `cat $i | grep "##" | wc -l `   >> $stdout;
        echo "output_column:" `cat $i | grep -V "##" | head -1 | wc -w `   >> $stdout;
        echo "output_preview:"   >> $stdout;
        cat $i  | grep -v "##" | head  | cut -f 1,2,3,4,5,6   >> $stdout ; done

Processing of psichomics output#

It occurs that the psichomatic by default grouped the isoforms by gene name, so only thing needs to be done is to extract this information and potentially renamed the gene symbol into ENSG ID

[annotate_psichomics_isoforms]
parameter: sample_participant_lookup = path()
input: phenoFile, annotation_gtf
output: f'{cwd:a}/{_input[0]:bn}.formated.bed.gz', f'{cwd:a}/{_input[0]:bn}.phenotype_group.txt'
task: trunk_workers = 1, trunk_size = job_size, walltime = walltime,  mem = mem, tags = f'{step_name}_{_output[0]:bn}'  
python: expand= "${ }", stderr = f'{_output[0]:n}.stderr', stdout = f'{_output[0]:n}.stdout', container = container, entrypoint=entrypoint
    import pandas as pd
    import numpy as np
    import qtl.io
    from pathlib import Path
    tss_df = qtl.io.gtf_to_tss_bed(${_input[1]:r}, feature='gene',phenotype_id = "gene_id" )
    bed_df = pd.read_csv(${_input[0]:ar}, sep='\t', skiprows=0)
    bed_df["gene_id"]  = [x[-1] for x in bed_df.ID.str.split("_")]
    sample_participant_lookup = Path("${sample_participant_lookup:a}")
    if "start" in  bed_df.columns:
        bed_df = bed_df.drop(["#Chr","start","end"],axis = 1)
    output = tss_df.merge(bed_df, left_on = ["gene_id"], right_on = ["gene_id"],how = "right").sort_values(["chr","start"])
    # change sample IDs to participant IDs
    if sample_participant_lookup.is_file():
        sample_participant_lookup_s = pd.read_csv(sample_participant_lookup, sep="\t", index_col=0, dtype={0:str,1:str})

        column_mapping = dict(zip(sample_participant_lookup_s.index, sample_participant_lookup_s['participant_id']))

        column_names = output.columns[4:]
        print(column_names)
        new_column_names = [column_mapping.get(col, col) for col in column_names]
        output.rename(columns=dict(zip(column_names, new_column_names)), inplace=True)

        #output.rename(columns=sample_participant_lookup_s.to_dict(), inplace=True)

    # Old code grouping by each gene
    #bed_output = output.drop("gene_id" , axis = 1)
    #phenotype_group = output[["ID","gene_id"]]

    bed_output = output.drop("gene_id" , axis = 1)

    # R version 
    #output = output%>%mutate(gene_type_id = map_chr(str_split( ID ,"_"),~paste0(.x[length(.x)],"_",.x[1] )) )
    
    # corrected in python
    output = output.assign(gene_type_id = output['ID'].str.split('_').map(lambda x: x[-1] + '_' + x[0]))
    phenotype_group = output[["ID","gene_type_id"]]

    bed_output.to_csv(${_output[0]:nr},"\t",index = False)
    phenotype_group.to_csv(${_output[1]:r},"\t",index = False,header=False)

bash: expand= "${ }", stderr = f'{_output[0]:n}.stderr', stdout = f'{_output[0]:n}.stdout', container = container, entrypoint=entrypoint
    bgzip -f ${_output[0]:n}
    tabix ${_output[0]}

bash: expand= "$[ ]", stderr = f'{_output[0]:n}.stderr', stdout = f'{_output[0]:n}.stdout', container = container, entrypoint=entrypoint
        stdout=$[_output[0]:n].stdout
        for i in $[_output[0]] ; do 
        echo "output_info: $i " >> $stdout;
        echo "output_size:" `ls -lh $i | cut -f 5  -d  " "`   >> $stdout;
        echo "output_rows:" `zcat $i | wc -l  | cut -f 1 -d " "`   >> $stdout;
        echo "output_headerow:" `zcat $i | grep "##" | wc -l `   >> $stdout;
        echo "output_column:" `zcat $i | grep -V "##" | head -1 | wc -w `   >> $stdout;
        echo "output_preview:"   >> $stdout;
        zcat $i  | grep -v "##" | head  | cut -f 1,2,3,4,5,6   >> $stdout ; done
        for i in $[_output[1]] ; do 
        echo "output_info: $i " >> $stdout;
        echo "output_size:" `ls -lh $i | cut -f 5  -d  " "`   >> $stdout;
        echo "output_rows:" `cat $i | wc -l  | cut -f 1 -d " "`   >> $stdout;
        echo "output_headerow:" `cat $i | grep "##" | wc -l `   >> $stdout;
        echo "output_column:" `cat $i | grep -V "##" | head -1 | wc -w `   >> $stdout;
        echo "output_preview:"   >> $stdout;
        cat $i  | grep -v "##" | head  | cut -f 1,2,3,4,5,6   >> $stdout ; done