Gene Coordinate Annotation#

Description#

We use a gene coordinate annotation pipeline based on pyqtl, as demonstrated here. This adds genomic coordinate annotations to gene-level molecular phenotype files generated in gct format and converts them to bed format for downstreams analysis.

Alternative implementation#

Previously we use biomaRt package in R instead of code from pyqtl. The core function calls are:

    ensembl = useEnsembl(biomart = "ensembl", dataset = "hsapiens_gene_ensembl", version = "$[ensembl_version]")
    ensembl_df <- getBM(attributes=c("ensembl_gene_id","chromosome_name", "start_position", "end_position"),mart=ensembl)

We require ENSEMBL version to be specified explicitly in this pipeline. As of 2021 for the Brain xQTL project, we use ENSEMBL version 103.

Input Files#

File

Description

input/rnaseq/protocol_example.rnaseq.bed.gz

Toy bulk RNA-seq expression matrix (gene-level).

input/proteomics/protocol_example.protein.no_coord.tsv

Toy protein matrix with gene_id|UniProt IDs, no coordinates.

input/rnaseq/protocol_example.rnaseq.gene_ID.tsv

Toy expression matrix with a gene_ID column, for the biomaRt example.

input/rnaseq/protocol_example.leafcutter.intron_count.tsv

Toy leafcutter intron-count table (IDs chr:start:end:clu_N_strand).

input/rnaseq/protocol_example.leafcutter.phenotype.bed.gz

Toy leafcutter intron-excision phenotype matrix.

input/rnaseq/protocol_example.psichomics.phenotype.tsv

Toy psichomics phenotype matrix (IDs end in _<gene_id>).

input/reference_data/Homo_sapiens.GRCh38.103.chr.reformatted.collapse_only.gene.ERCC.gtf

Collapsed gene-model GTF (gene/protein annotation).

input/reference_data/Homo_sapiens.GRCh38.103.chr.gtf

Full gene-model GTF with exons (leafcutter/psichomics).

Output Files#

Output

Description

Coordinate-annotated phenotype (*.bed.gz)

The input molecular phenotype matrix with genomic coordinates (#chr start end ID) prepended, ready for downstream QTL analysis.

Cluster-to-gene map (*.leafcutter.clusters_to_genes.txt)

(leafcutter) mapping of each intron cluster to its gene(s).

Phenotype group file (*.phenotype_group.txt)

(leafcutter/psichomics) mapping of each feature ID to its gene, for grouped QTL analysis.

Steps#

The example below adds genomic coordinates to a molecular phenotype matrix using the collapsed gene-model GTF. We show two cases: gene expression and proteins.

Gene Expression#

Timing: <1 min

Annotate a gene-level expression matrix. Each feature ID is an ENSEMBL gene ID that is matched directly against the GTF to obtain its chromosome, start and end coordinates.

sos run pipeline/gene_annotation.ipynb annotate_coord \
    --cwd output/gene_annotation \
    --phenoFile input/rnaseq/protocol_example.rnaseq.bed.gz \
    --coordinate-annotation input/reference_data/Homo_sapiens.GRCh38.103.chr.reformatted.collapse_only.gene.ERCC.gtf \
    --phenotype-id-column gene_id

The output protocol_example.rnaseq.bed.gz is written to output/gene_annotation/, with genomic coordinates (#chr start end ID) prepended to the expression matrix.

Proteins#

Timing: <1 min

Annotate a protein matrix. Here each feature ID has the form gene_id\|UniProt; the gene ID portion is matched against the GTF for coordinates and the UniProt portion is retained in the output feature ID. Use --molecular-trait-type protein to select this behaviour.

sos run pipeline/gene_annotation.ipynb annotate_coord \
    --cwd output/gene_annotation \
    --phenoFile input/proteomics/protocol_example.protein.no_coord.tsv \
    --coordinate-annotation input/reference_data/Homo_sapiens.GRCh38.103.chr.reformatted.collapse_only.gene.ERCC.gtf \
    --phenotype-id-column gene_id \
    --molecular-trait-type protein

Leafcutter clusters to genes#

Timing: <2 min

For leafcutter splicing data, intron clusters first need to be assigned to genes. The input is a leafcutter intron-count table whose feature IDs have the form chr:start:end:clu_N_strand; --coordinate-annotation is a full gene-model GTF (with exon records). The default --map-stra site maps introns to genes by matching splice-site coordinates to exon boundaries.

sos run pipeline/gene_annotation.ipynb map_leafcutter_cluster_to_gene \
    --cwd output/gene_annotation \
    --phenoFile input/rnaseq/protocol_example.leafcutter.phenotype.bed.gz \
    --intron-count input/rnaseq/protocol_example.leafcutter.intron_count.tsv \
    --coordinate-annotation input/reference_data/Homo_sapiens.GRCh38.103.chr.gtf \
    --map-stra site

Leafcutter isoforms#

Timing: <2 min

This example turns the raw leafcutter intron-excision output into a phenotype matrix that TensorQTL can use. Building on the cluster-to-gene mapping from the previous example, it attaches genomic coordinates and a gene assignment to every intron, then writes a coordinate-annotated phenotype BED together with a phenotype-group file that links each intron back to its gene. The same --intron-count table and --coordinate-annotation GTF used above are supplied here.

Along the way the step produces three intermediate files, shown below with a few example rows from the toy data.

1. Exon list — exon records parsed from the annotation GTF, used to match intron splice sites to genes (<intron_count>.exon_list):

chr

start

end

strand

gene_id

gene_name

1

111869

112227

+

ENSG00000223972

DDX11L1

1

112613

112721

+

ENSG00000223972

DDX11L1

2. Clusters-to-genes — each leafcutter cluster mapped to the gene it belongs to (<intron_count>.leafcutter.clusters_to_genes.txt):

clu

genes

22:clu_10_+

ENSG00000236052

22:clu_11_-

ENSG00000185264

3. Phenotype group — links each annotated intron feature to its gene, so introns from the same gene are grouped together (<phenotype>.phenotype_group.txt):

ID

gene

22:19149095:19149663:clu_14_-:ENSG00000063515

ENSG00000063515

22:19149916:19150025:clu_14_-:ENSG00000063515

ENSG00000063515

sos run pipeline/gene_annotation.ipynb annotate_leafcutter_isoforms \
    --cwd output/gene_annotation \
    --phenoFile input/rnaseq/protocol_example.leafcutter.phenotype.bed.gz \
    --intron-count input/rnaseq/protocol_example.leafcutter.intron_count.tsv \
    --coordinate-annotation input/reference_data/Homo_sapiens.GRCh38.103.chr.gtf \
    --map-stra site

Psichomics isoforms#

Timing: <2 min

For psichomics splicing quantifications, each event ID ends in _<gene_id>; the gene ID is matched against the GTF to obtain coordinates. The output is a coordinate-annotated phenotype BED plus a phenotype-group file.

sos run pipeline/gene_annotation.ipynb annotate_psichomics_isoforms \
    --cwd output/gene_annotation \
    --phenoFile input/rnaseq/protocol_example.psichomics.phenotype.tsv \
    --coordinate-annotation input/reference_data/Homo_sapiens.GRCh38.103.chr.gtf

Alternative coordinate annotation with biomaRt#

Timing: <2 min (depends on network)

Instead of a local GTF, coordinates can be fetched from the Ensembl biomaRt web service. The phenotype matrix must have a gene_ID column of ENSEMBL gene IDs. --ensembl-version selects the Ensembl release; use a release whose server is reachable from your network (e.g. a current release). This step requires internet access to Ensembl.

sos run pipeline/gene_annotation.ipynb annotate_coord_biomart \
    --cwd output/gene_annotation \
    --phenoFile input/rnaseq/protocol_example.rnaseq.gene_ID.tsv \
    --ensembl-version 115

Command Interface#

sos run gene_annotation.ipynb -h
usage: sos run gene_annotation.ipynb
               [workflow_name | -t targets] [options] [workflow_options]
  workflow_name:        Single or combined workflows defined in this script
  targets:              One or more targets to generate
  options:              Single-hyphen sos parameters (see "sos run -h" for details)
  workflow_options:     Double-hyphen workflow-specific parameters

Workflows:
  region_list_generation
  annotate_coord
  annotate_coord_biomart
  map_leafcutter_cluster_to_gene
  annotate_leafcutter_isoforms
  annotate_psichomics_isoforms

Global Workflow Options:
  --cwd output (as path)
                        Work directory & output directory
  --phenoFile VAL (as path, required)
                        Molecular phenotype matrix
  --job-size 1 (as int)
                        For cluster jobs, number commands to run per job
  --walltime 5h
                        Wall clock time expected
  --mem 16G
                        Memory expected
  --numThreads 1 (as int)
                        Number of threads
  --container ''

Sections
  region_list_generation:
  annotate_coord:
    Workflow Options:
      --sample-participant-lookup . (as path)
                        A file to map sample ID from expression to genotype,
                        must contain two columns, sample_id and participant_id,
                        mapping IDs in the expression files to IDs in the
                        genotype (these can be the same).
      --molecular-trait-type gene
                        Options: gene, protein, atac
      --coordinate-annotation VAL (as path, required)
                        gtf annotation for RNA-seq data; other types of
                        annotation index for protein and atac-seq
      --phenotype-id-column 'gene_id'
                        gene_id or gene_name for RNA-seq data, SOMAseqID for
                        proteins for example
      --auxiliary-id-mapping . (as path)
                        Optional mapping file (protein_name_index for protein,
                        etc.)
      --[no-]strip-id (default to False)
                        Optional processing flag
  annotate_coord_biomart:
    Workflow Options:
      --ensembl-version VAL (as int, required)
  map_leafcutter_cluster_to_gene:
    Workflow Options:
      --intron-count VAL (as path, required)
                        Extract the code in case psichromatic needs to be
                        processed the same way PheoFile in this step is the
                        intron_count file
      --map-stra site
                        Defines the mapping strategy options: 'site' or
                        'region', with 'site' as the default. The 'site'
                        strategy maps introns to the start and end of each exon.
                        The 'region' strategy, to be recommended in leafcutter2,
                        maps each intron based on it overlapping more than
                        overlap_ratio of a gene's region.
      --overlap-ratio 0.8 (as float)
                        Define the overlap ratio as the proportion of the
                        cluster length that intersects with a gene, used to
                        determine mapping to the gene.
  annotate_leafcutter_isoforms:
    Workflow Options:
      --sample-participant-lookup . (as path)
  annotate_psichomics_isoforms:
    Workflow Options:
      --sample-participant-lookup . (as path)

Setup and global parameters#

[global]
parameter: modular_script_dir = path('code/script')  # override with --modular-script-dir
# Work directory & output directory
parameter: cwd = path("output")
# Molecular phenotype matrix
parameter: phenoFile = path
# For cluster jobs, number commands to run per job
parameter: job_size = 1
# Wall clock time expected
parameter: walltime = "5h"
# Memory expected
parameter: mem = "16G"
# Number of threads
parameter: numThreads = 1
parameter: container = ""
parameter: entrypoint= ""

Implementation using pyqtl#

Implementation based on GTEx pipeline.

Following step serves to annotate coord for gene expression file.

[annotate_coord]
# A file to map sample ID from expression to genotype, must contain two columns, sample_id and participant_id, mapping IDs in the expression files to IDs in the genotype (these can be the same).
parameter: sample_participant_lookup = path()
# Options: gene, protein, atac
parameter: molecular_trait_type = "gene" 
# gtf annotation for RNA-seq data; other types of annotation index for protein and atac-seq
parameter: coordinate_annotation = path
# gene_id or gene_name for RNA-seq data, SOMAseqID for proteins for example
parameter: phenotype_id_column = "gene_id"
# Optional mapping file (protein_name_index for protein, etc.)
parameter: auxiliary_id_mapping = path()
# Optional processing flag
parameter: strip_id = False
parameter: sep = "\t"

input: phenoFile, coordinate_annotation
output: f'{cwd:a}/{_input[0]:bn}.bed.gz', f'{cwd:a}/{_input[0]:bn}.region_list.txt'
task: trunk_workers = 1, trunk_size = job_size, walltime = walltime, mem = mem, tags = f'{step_name}_{_output[0]:bn}'
bash: expand= "${ }", stderr = f'{_output[0]:n}.stderr', stdout = f'{_output[0]:n}.stdout', container = container, entrypoint = entrypoint
    python3 ${modular_script_dir}/data_preprocessing/phenotype/gene_annotation.py \
        --step annotate_coord \
        --phenoFile "${_input[0]}" \
        --coordinate-annotation "${_input[1]}" \
        --sample-participant-lookup "${sample_participant_lookup}" \
        --molecular-trait-type "${molecular_trait_type}" \
        --phenotype-id-column "${phenotype_id_column}" \
        --auxiliary-id-mapping "${auxiliary_id_mapping}" \
        --sep "${sep}" \
        ${"--strip-id" if strip_id else ""} \
        --output-bed "${_output[0]}" \
        --output-region-list "${_output[1]}" \
        --gene-list-output "${cwd:a}/${_input[0]:bn}.gene_list.tsv"

Implementation using biomaRt#

This workflow adds the annotations of chr pos(TSS where start = end -1) and gene_ID to the bed file. This workflow is obsolete.

[annotate_coord_biomart]
parameter: ensembl_version=int
input: phenoFile
output: f'{cwd:a}/{_input:bn}.bed.gz',
        f'{cwd:a}/{_input:bn}.region_list.txt'
task: trunk_workers = 1, trunk_size = job_size, walltime = walltime,  mem = mem, tags = f'{step_name}_{_output[0]:bn}'  
bash: expand= "${ }", stderr = f'{_output[0]:n}.stderr', stdout = f'{_output[0]:n}.stdout', container = container, entrypoint = entrypoint
    Rscript ${modular_script_dir}/data_preprocessing/phenotype/gene_annotation.R \
        --step annotate_coord_biomart \
        --phenoFile "${_input[0]}" \
        --ensembl-version ${ensembl_version} \
        --output-bed "${_output[0]}" \
        --output-region-list "${_output[1]}"

Annotation of leafcutter isoform#

The gtf used here should be the collapsed gtf, i.e. the final output of reference_data gtf processing and the one used to called rnaseq.

[map_leafcutter_cluster_to_gene]
## Extract the code in case psichromatic needs to be processed the same way
## PheoFile in this step is the intron_count file
parameter: intron_count = path
# Defines the mapping strategy options: 'site' or 'region', with 'site' as the default. 
# The 'site' strategy maps introns to the start and end of each exon. 
# The 'region' strategy, to be recommended in leafcutter2, maps each intron based on it overlapping more than overlap_ratio of a gene's region.
parameter: map_stra = "site" 
# Define the overlap ratio as the proportion of the cluster length that intersects with a gene, used to determine mapping to the gene.
parameter: overlap_ratio = 0.8
input: intron_count, annotation_gtf
output: f'{cwd}/{_input[0]:b}.exon_list', f'{cwd}/{_input[0]:b}.leafcutter.clusters_to_genes.txt'
task: trunk_workers = 1, trunk_size = job_size, walltime = walltime,  mem = mem, tags = f'{step_name}_{_output[0]:bn}'  
bash: expand= "${ }", stderr = f'{_output[0]:n}.stderr', stdout = f'{_output[0]:n}.stdout', container = container, entrypoint = entrypoint
    python3 ${modular_script_dir}/data_preprocessing/phenotype/gene_annotation.py \
        --step map_leafcutter_cluster_to_gene \
        --intron-count "${_input[0]}" \
        --annotation-gtf "${_input[1]}" \
        --map-stra "${map_stra}" \
        --overlap-ratio ${overlap_ratio} \
        --output-exon-list "${_output[0]}" \
        --output-cluster-map "${_output[1]}"
[annotate_leafcutter_isoforms]
parameter: sample_participant_lookup = path()
input: phenoFile, annotation_gtf,output_from("map_leafcutter_cluster_to_gene")
output: f'{cwd:a}/{_input[0]:bn}.formated.bed.gz', f'{cwd:a}/{_input[0]:bn}.phenotype_group.txt'
task: trunk_workers = 1, trunk_size = job_size, walltime = walltime,  mem = mem, tags = f'{step_name}_{_output[0]:bn}'  
bash: expand= "${ }", stderr = f'{_output[0]:n}.stderr', stdout = f'{_output[0]:n}.stdout', container = container, entrypoint = entrypoint
    python3 ${modular_script_dir}/data_preprocessing/phenotype/gene_annotation.py \
        --step annotate_leafcutter_isoforms \
        --phenoFile "${_input[0]}" \
        --annotation-gtf "${_input[1]}" \
        --cluster-map "${_input[3]}" \
        --sample-participant-lookup "${sample_participant_lookup}" \
        --output-bed "${_output[0]}" \
        --output-phenotype-group "${_output[1]}"

Processing of psichomics output#

It occurs that the psichomatic by default grouped the isoforms by gene name, so only thing needs to be done is to extract this information and potentially renamed the gene symbol into ENSG ID

Anticipated Results#

The pipeline produces output files in the output/ subdirectory named after the workflow step. Verify success by checking that output files exist and are non-empty. See the Output section above for the expected file names and formats.

[annotate_psichomics_isoforms]
parameter: sample_participant_lookup = path()
input: phenoFile, annotation_gtf
output: f'{cwd:a}/{_input[0]:bn}.formated.bed.gz', f'{cwd:a}/{_input[0]:bn}.phenotype_group.txt'
task: trunk_workers = 1, trunk_size = job_size, walltime = walltime,  mem = mem, tags = f'{step_name}_{_output[0]:bn}'  
bash: expand= "${ }", stderr = f'{_output[0]:n}.stderr', stdout = f'{_output[0]:n}.stdout', container = container, entrypoint = entrypoint
    python3 ${modular_script_dir}/data_preprocessing/phenotype/gene_annotation.py \
        --step annotate_psichomics_isoforms \
        --phenoFile "${_input[0]}" \
        --annotation-gtf "${_input[1]}" \
        --sample-participant-lookup "${sample_participant_lookup}" \
        --output-bed "${_output[0]}" \
        --output-phenotype-group "${_output[1]}"