Gene Coordinate Annotation#
Description#
We use a gene coordinate annotation pipeline based on pyqtl, as demonstrated here. This adds genomic coordinate annotations to gene-level molecular phenotype files generated in gct format and converts them to bed format for downstreams analysis.
Alternative implementation#
Previously we use biomaRt package in R instead of code from pyqtl. The core function calls are:
ensembl = useEnsembl(biomart = "ensembl", dataset = "hsapiens_gene_ensembl", version = "$[ensembl_version]")
ensembl_df <- getBM(attributes=c("ensembl_gene_id","chromosome_name", "start_position", "end_position"),mart=ensembl)
We require ENSEMBL version to be specified explicitly in this pipeline. As of 2021 for the Brain xQTL project, we use ENSEMBL version 103.
Input Files#
File |
Description |
|---|---|
|
Toy bulk RNA-seq expression matrix (gene-level). |
|
Toy protein matrix with |
|
Toy expression matrix with a |
|
Toy leafcutter intron-count table (IDs |
|
Toy leafcutter intron-excision phenotype matrix. |
|
Toy psichomics phenotype matrix (IDs end in |
|
Collapsed gene-model GTF (gene/protein annotation). |
|
Full gene-model GTF with exons (leafcutter/psichomics). |
Output Files#
Output |
Description |
|---|---|
Coordinate-annotated phenotype ( |
The input molecular phenotype matrix with genomic coordinates ( |
Cluster-to-gene map ( |
(leafcutter) mapping of each intron cluster to its gene(s). |
Phenotype group file ( |
(leafcutter/psichomics) mapping of each feature ID to its gene, for grouped QTL analysis. |
Steps#
The example below adds genomic coordinates to a molecular phenotype matrix using the collapsed gene-model GTF. We show two cases: gene expression and proteins.
Gene Expression#
Timing: <1 min
Annotate a gene-level expression matrix. Each feature ID is an ENSEMBL gene ID that is matched directly against the GTF to obtain its chromosome, start and end coordinates.
sos run pipeline/gene_annotation.ipynb annotate_coord \
--cwd output/gene_annotation \
--phenoFile input/rnaseq/protocol_example.rnaseq.bed.gz \
--coordinate-annotation input/reference_data/Homo_sapiens.GRCh38.103.chr.reformatted.collapse_only.gene.ERCC.gtf \
--phenotype-id-column gene_id
The output protocol_example.rnaseq.bed.gz is written to output/gene_annotation/, with genomic coordinates (#chr start end ID) prepended to the expression matrix.
Proteins#
Timing: <1 min
Annotate a protein matrix. Here each feature ID has the form gene_id\|UniProt; the gene ID portion is matched against the GTF for coordinates and the UniProt portion is retained in the output feature ID. Use --molecular-trait-type protein to select this behaviour.
sos run pipeline/gene_annotation.ipynb annotate_coord \
--cwd output/gene_annotation \
--phenoFile input/proteomics/protocol_example.protein.no_coord.tsv \
--coordinate-annotation input/reference_data/Homo_sapiens.GRCh38.103.chr.reformatted.collapse_only.gene.ERCC.gtf \
--phenotype-id-column gene_id \
--molecular-trait-type protein
Leafcutter clusters to genes#
Timing: <2 min
For leafcutter splicing data, intron clusters first need to be assigned to genes. The input is a leafcutter intron-count table whose feature IDs have the form chr:start:end:clu_N_strand; --coordinate-annotation is a full gene-model GTF (with exon records). The default --map-stra site maps introns to genes by matching splice-site coordinates to exon boundaries.
sos run pipeline/gene_annotation.ipynb map_leafcutter_cluster_to_gene \
--cwd output/gene_annotation \
--phenoFile input/rnaseq/protocol_example.leafcutter.phenotype.bed.gz \
--intron-count input/rnaseq/protocol_example.leafcutter.intron_count.tsv \
--coordinate-annotation input/reference_data/Homo_sapiens.GRCh38.103.chr.gtf \
--map-stra site
Leafcutter isoforms#
Timing: <2 min
This example turns the raw leafcutter intron-excision output into a phenotype matrix that TensorQTL can use. Building on the cluster-to-gene mapping from the previous example, it attaches genomic coordinates and a gene assignment to every intron, then writes a coordinate-annotated phenotype BED together with a phenotype-group file that links each intron back to its gene. The same --intron-count table and --coordinate-annotation GTF used above are supplied here.
Along the way the step produces three intermediate files, shown below with a few example rows from the toy data.
1. Exon list — exon records parsed from the annotation GTF, used to match intron splice sites to genes (<intron_count>.exon_list):
chr |
start |
end |
strand |
gene_id |
gene_name |
|---|---|---|---|---|---|
1 |
111869 |
112227 |
+ |
ENSG00000223972 |
DDX11L1 |
1 |
112613 |
112721 |
+ |
ENSG00000223972 |
DDX11L1 |
2. Clusters-to-genes — each leafcutter cluster mapped to the gene it belongs to (<intron_count>.leafcutter.clusters_to_genes.txt):
clu |
genes |
|---|---|
22:clu_10_+ |
ENSG00000236052 |
22:clu_11_- |
ENSG00000185264 |
3. Phenotype group — links each annotated intron feature to its gene, so introns from the same gene are grouped together (<phenotype>.phenotype_group.txt):
ID |
gene |
|---|---|
22:19149095:19149663:clu_14_-:ENSG00000063515 |
ENSG00000063515 |
22:19149916:19150025:clu_14_-:ENSG00000063515 |
ENSG00000063515 |
sos run pipeline/gene_annotation.ipynb annotate_leafcutter_isoforms \
--cwd output/gene_annotation \
--phenoFile input/rnaseq/protocol_example.leafcutter.phenotype.bed.gz \
--intron-count input/rnaseq/protocol_example.leafcutter.intron_count.tsv \
--coordinate-annotation input/reference_data/Homo_sapiens.GRCh38.103.chr.gtf \
--map-stra site
Psichomics isoforms#
Timing: <2 min
For psichomics splicing quantifications, each event ID ends in _<gene_id>; the gene ID is matched against the GTF to obtain coordinates. The output is a coordinate-annotated phenotype BED plus a phenotype-group file.
sos run pipeline/gene_annotation.ipynb annotate_psichomics_isoforms \
--cwd output/gene_annotation \
--phenoFile input/rnaseq/protocol_example.psichomics.phenotype.tsv \
--coordinate-annotation input/reference_data/Homo_sapiens.GRCh38.103.chr.gtf
Alternative coordinate annotation with biomaRt#
Timing: <2 min (depends on network)
Instead of a local GTF, coordinates can be fetched from the Ensembl biomaRt web service. The phenotype matrix must have a gene_ID column of ENSEMBL gene IDs. --ensembl-version selects the Ensembl release; use a release whose server is reachable from your network (e.g. a current release). This step requires internet access to Ensembl.
sos run pipeline/gene_annotation.ipynb annotate_coord_biomart \
--cwd output/gene_annotation \
--phenoFile input/rnaseq/protocol_example.rnaseq.gene_ID.tsv \
--ensembl-version 115
Command Interface#
sos run gene_annotation.ipynb -h
usage: sos run gene_annotation.ipynb
[workflow_name | -t targets] [options] [workflow_options]
workflow_name: Single or combined workflows defined in this script
targets: One or more targets to generate
options: Single-hyphen sos parameters (see "sos run -h" for details)
workflow_options: Double-hyphen workflow-specific parameters
Workflows:
region_list_generation
annotate_coord
annotate_coord_biomart
map_leafcutter_cluster_to_gene
annotate_leafcutter_isoforms
annotate_psichomics_isoforms
Global Workflow Options:
--cwd output (as path)
Work directory & output directory
--phenoFile VAL (as path, required)
Molecular phenotype matrix
--job-size 1 (as int)
For cluster jobs, number commands to run per job
--walltime 5h
Wall clock time expected
--mem 16G
Memory expected
--numThreads 1 (as int)
Number of threads
--container ''
Sections
region_list_generation:
annotate_coord:
Workflow Options:
--sample-participant-lookup . (as path)
A file to map sample ID from expression to genotype,
must contain two columns, sample_id and participant_id,
mapping IDs in the expression files to IDs in the
genotype (these can be the same).
--molecular-trait-type gene
Options: gene, protein, atac
--coordinate-annotation VAL (as path, required)
gtf annotation for RNA-seq data; other types of
annotation index for protein and atac-seq
--phenotype-id-column 'gene_id'
gene_id or gene_name for RNA-seq data, SOMAseqID for
proteins for example
--auxiliary-id-mapping . (as path)
Optional mapping file (protein_name_index for protein,
etc.)
--[no-]strip-id (default to False)
Optional processing flag
annotate_coord_biomart:
Workflow Options:
--ensembl-version VAL (as int, required)
map_leafcutter_cluster_to_gene:
Workflow Options:
--intron-count VAL (as path, required)
Extract the code in case psichromatic needs to be
processed the same way PheoFile in this step is the
intron_count file
--map-stra site
Defines the mapping strategy options: 'site' or
'region', with 'site' as the default. The 'site'
strategy maps introns to the start and end of each exon.
The 'region' strategy, to be recommended in leafcutter2,
maps each intron based on it overlapping more than
overlap_ratio of a gene's region.
--overlap-ratio 0.8 (as float)
Define the overlap ratio as the proportion of the
cluster length that intersects with a gene, used to
determine mapping to the gene.
annotate_leafcutter_isoforms:
Workflow Options:
--sample-participant-lookup . (as path)
annotate_psichomics_isoforms:
Workflow Options:
--sample-participant-lookup . (as path)
Setup and global parameters#
[global]
parameter: modular_script_dir = path('code/script') # override with --modular-script-dir
# Work directory & output directory
parameter: cwd = path("output")
# Molecular phenotype matrix
parameter: phenoFile = path
# For cluster jobs, number commands to run per job
parameter: job_size = 1
# Wall clock time expected
parameter: walltime = "5h"
# Memory expected
parameter: mem = "16G"
# Number of threads
parameter: numThreads = 1
parameter: container = ""
parameter: entrypoint= ""
Implementation using pyqtl#
Implementation based on GTEx pipeline.
Following step serves to annotate coord for gene expression file.
[annotate_coord]
# A file to map sample ID from expression to genotype, must contain two columns, sample_id and participant_id, mapping IDs in the expression files to IDs in the genotype (these can be the same).
parameter: sample_participant_lookup = path()
# Options: gene, protein, atac
parameter: molecular_trait_type = "gene"
# gtf annotation for RNA-seq data; other types of annotation index for protein and atac-seq
parameter: coordinate_annotation = path
# gene_id or gene_name for RNA-seq data, SOMAseqID for proteins for example
parameter: phenotype_id_column = "gene_id"
# Optional mapping file (protein_name_index for protein, etc.)
parameter: auxiliary_id_mapping = path()
# Optional processing flag
parameter: strip_id = False
parameter: sep = "\t"
input: phenoFile, coordinate_annotation
output: f'{cwd:a}/{_input[0]:bn}.bed.gz', f'{cwd:a}/{_input[0]:bn}.region_list.txt'
task: trunk_workers = 1, trunk_size = job_size, walltime = walltime, mem = mem, tags = f'{step_name}_{_output[0]:bn}'
bash: expand= "${ }", stderr = f'{_output[0]:n}.stderr', stdout = f'{_output[0]:n}.stdout', container = container, entrypoint = entrypoint
python3 ${modular_script_dir}/data_preprocessing/phenotype/gene_annotation.py \
--step annotate_coord \
--phenoFile "${_input[0]}" \
--coordinate-annotation "${_input[1]}" \
--sample-participant-lookup "${sample_participant_lookup}" \
--molecular-trait-type "${molecular_trait_type}" \
--phenotype-id-column "${phenotype_id_column}" \
--auxiliary-id-mapping "${auxiliary_id_mapping}" \
--sep "${sep}" \
${"--strip-id" if strip_id else ""} \
--output-bed "${_output[0]}" \
--output-region-list "${_output[1]}" \
--gene-list-output "${cwd:a}/${_input[0]:bn}.gene_list.tsv"
Implementation using biomaRt#
This workflow adds the annotations of chr pos(TSS where start = end -1) and gene_ID to the bed file. This workflow is obsolete.
[annotate_coord_biomart]
parameter: ensembl_version=int
input: phenoFile
output: f'{cwd:a}/{_input:bn}.bed.gz',
f'{cwd:a}/{_input:bn}.region_list.txt'
task: trunk_workers = 1, trunk_size = job_size, walltime = walltime, mem = mem, tags = f'{step_name}_{_output[0]:bn}'
bash: expand= "${ }", stderr = f'{_output[0]:n}.stderr', stdout = f'{_output[0]:n}.stdout', container = container, entrypoint = entrypoint
Rscript ${modular_script_dir}/data_preprocessing/phenotype/gene_annotation.R \
--step annotate_coord_biomart \
--phenoFile "${_input[0]}" \
--ensembl-version ${ensembl_version} \
--output-bed "${_output[0]}" \
--output-region-list "${_output[1]}"
Annotation of leafcutter isoform#
The gtf used here should be the collapsed gtf, i.e. the final output of reference_data gtf processing and the one used to called rnaseq.
[map_leafcutter_cluster_to_gene]
## Extract the code in case psichromatic needs to be processed the same way
## PheoFile in this step is the intron_count file
parameter: intron_count = path
# Defines the mapping strategy options: 'site' or 'region', with 'site' as the default.
# The 'site' strategy maps introns to the start and end of each exon.
# The 'region' strategy, to be recommended in leafcutter2, maps each intron based on it overlapping more than overlap_ratio of a gene's region.
parameter: map_stra = "site"
# Define the overlap ratio as the proportion of the cluster length that intersects with a gene, used to determine mapping to the gene.
parameter: overlap_ratio = 0.8
input: intron_count, annotation_gtf
output: f'{cwd}/{_input[0]:b}.exon_list', f'{cwd}/{_input[0]:b}.leafcutter.clusters_to_genes.txt'
task: trunk_workers = 1, trunk_size = job_size, walltime = walltime, mem = mem, tags = f'{step_name}_{_output[0]:bn}'
bash: expand= "${ }", stderr = f'{_output[0]:n}.stderr', stdout = f'{_output[0]:n}.stdout', container = container, entrypoint = entrypoint
python3 ${modular_script_dir}/data_preprocessing/phenotype/gene_annotation.py \
--step map_leafcutter_cluster_to_gene \
--intron-count "${_input[0]}" \
--annotation-gtf "${_input[1]}" \
--map-stra "${map_stra}" \
--overlap-ratio ${overlap_ratio} \
--output-exon-list "${_output[0]}" \
--output-cluster-map "${_output[1]}"
[annotate_leafcutter_isoforms]
parameter: sample_participant_lookup = path()
input: phenoFile, annotation_gtf,output_from("map_leafcutter_cluster_to_gene")
output: f'{cwd:a}/{_input[0]:bn}.formated.bed.gz', f'{cwd:a}/{_input[0]:bn}.phenotype_group.txt'
task: trunk_workers = 1, trunk_size = job_size, walltime = walltime, mem = mem, tags = f'{step_name}_{_output[0]:bn}'
bash: expand= "${ }", stderr = f'{_output[0]:n}.stderr', stdout = f'{_output[0]:n}.stdout', container = container, entrypoint = entrypoint
python3 ${modular_script_dir}/data_preprocessing/phenotype/gene_annotation.py \
--step annotate_leafcutter_isoforms \
--phenoFile "${_input[0]}" \
--annotation-gtf "${_input[1]}" \
--cluster-map "${_input[3]}" \
--sample-participant-lookup "${sample_participant_lookup}" \
--output-bed "${_output[0]}" \
--output-phenotype-group "${_output[1]}"
Processing of psichomics output#
It occurs that the psichomatic by default grouped the isoforms by gene name, so only thing needs to be done is to extract this information and potentially renamed the gene symbol into ENSG ID
Anticipated Results#
The pipeline produces output files in the output/ subdirectory named after the workflow step. Verify success by checking that output files exist and are non-empty. See the Output section above for the expected file names and formats.
[annotate_psichomics_isoforms]
parameter: sample_participant_lookup = path()
input: phenoFile, annotation_gtf
output: f'{cwd:a}/{_input[0]:bn}.formated.bed.gz', f'{cwd:a}/{_input[0]:bn}.phenotype_group.txt'
task: trunk_workers = 1, trunk_size = job_size, walltime = walltime, mem = mem, tags = f'{step_name}_{_output[0]:bn}'
bash: expand= "${ }", stderr = f'{_output[0]:n}.stderr', stdout = f'{_output[0]:n}.stdout', container = container, entrypoint = entrypoint
python3 ${modular_script_dir}/data_preprocessing/phenotype/gene_annotation.py \
--step annotate_psichomics_isoforms \
--phenoFile "${_input[0]}" \
--annotation-gtf "${_input[1]}" \
--sample-participant-lookup "${sample_participant_lookup}" \
--output-bed "${_output[0]}" \
--output-phenotype-group "${_output[1]}"