Gene Coordinate Annotation#
Description#
We use a gene coordinate annotation pipeline based on pyqtl
, as demonstrated here. This adds genomic coordinate annotations to gene-level molecular phenotype files generated in gct
format and converts them to bed
format for downstreams analysis.
Alternative implementation#
Previously we use biomaRt
package in R instead of code from pyqtl
. The core function calls are:
ensembl = useEnsembl(biomart = "ensembl", dataset = "hsapiens_gene_ensembl", version = "$[ensembl_version]")
ensembl_df <- getBM(attributes=c("ensembl_gene_id","chromosome_name", "start_position", "end_position"),mart=ensembl)
We require ENSEMBL version to be specified explicitly in this pipeline. As of 2021 for the Brain xQTL project, we use ENSEMBL version 103.
Input#
Molecular phenotype data in
gct
format, with the first column being ENSEMBL ID and other columns being sample names.GTF for collapsed gene model
the gene names must be consistent with the molecular phenotype data matrices (eg ENSG00000000003 vs. ENSG00000000003.1 will not work)
(Optional) Meta-data to match between sample names in expression data and genotype files
Tab delimited with header
Only 2 columns: first column is sample name in expression data, 2nd column is sample name in genotype data
must contains all the sample name in expression matrices even if they don’t existing in genotype data
Output#
Molecular phenotype data in bed
format.
Minimal Working Example Steps#
The MWE is uploaded to Synapse
i. Cooridnate Annotation#
a. Gene Expression#
Timing: <1 min
sos run pipeline/gene_annotation.ipynb annotate_coord \
--cwd output/rnaseq \
--phenoFile output/rnaseq/bulk_rnaseq_tmp_matrix.low_expression_filtered.outlier_removed.tmm.expression.bed.gz \
--coordinate-annotation reference_data/Homo_sapiens.GRCh38.103.chr.reformatted.collapse_only.gene.ERCC.gtf \
--phenotype-id-column gene_id
INFO: Running annotate_coord:
INFO: annotate_coord is completed.
INFO: annotate_coord output: /restricted/projectnb/xqtl/xqtl_protocol/toy_xqtl_protocol/output/rnaseq/bulk_rnaseq_tmp_matrix.low_expression_filtered.outlier_removed.tmm.expression.bed.bed.gz /restricted/projectnb/xqtl/xqtl_protocol/toy_xqtl_protocol/output/rnaseq/bulk_rnaseq_tmp_matrix.low_expression_filtered.outlier_removed.tmm.expression.bed.region_list.txt
INFO: Workflow annotate_coord (ID=wc0f3b36281bafaba) is executed successfully with 1 completed step.
[fgrennjr@scc-ym2 toy_xqtl_protocol]$ cd output/rnaseq/
b. Proteins#
Timing: <1 min
!sos run gene_annotation.ipynb annotate_coord \
--cwd ../../../output_test/phenotype_by_chrom \
--phenoFile ../../../mwe_data/protocol_data/input/xqtl_association/protocol_example.protein.csv \
--coordinate-annotation ../../../reference_data/Homo_sapiens.GRCh38.103.chr.reformatted.collapse_only.gene.ERCC.gtf \
--sample-participant-lookup ../../../mwe_data/protocol_data/output/protocol_example.protein.sample_overlap.txt \
--phenotype-id-column gene_name \
--molecular-trait-type protein \
--sep ","
INFO: Running annotate_coord_protein:
INFO: tbecf36ea3e901ea1 re-execute completed
INFO: tbecf36ea3e901ea1 submitted to neurology with job id Your job 3961996 ("job_tbecf36ea3e901ea1") has been submitted
INFO: Waiting for the completion of 1 task.
INFO: Waiting for the completion of 1 task.
INFO: annotate_coord_protein output: /restricted/projectnb/xqtl/xqtl_protocol/output_test/phenotype_by_chrom/protocol_example.protein.bed.gz /restricted/projectnb/xqtl/xqtl_protocol/output_test/phenotype_by_chrom/protocol_example.protein.region_list.txt
INFO: Workflow annotate_coord_protein (ID=w75d21004c6718efd) is executed successfully with 1 completed step and 1 completed task.
Troubleshooting#
Step |
Substep |
Problem |
Possible Reason |
Solution |
---|---|---|---|---|
Command Interface#
!sos run gene_annotation.ipynb -h
usage: sos run gene_annotation.ipynb
[workflow_name | -t targets] [options] [workflow_options]
workflow_name: Single or combined workflows defined in this script
targets: One or more targets to generate
options: Single-hyphen sos parameters (see "sos run -h" for details)
workflow_options: Double-hyphen workflow-specific parameters
Workflows:
region_list_generation
annotate_coord
annotate_coord_biomart
map_leafcutter_cluster_to_gene
annotate_leafcutter_isoforms
annotate_psichomics_isoforms
Global Workflow Options:
--cwd output (as path)
Work directory & output directory
--phenoFile VAL (as path, required)
Molecular phenotype matrix
--job-size 1 (as int)
For cluster jobs, number commands to run per job
--walltime 5h
Wall clock time expected
--mem 16G
Memory expected
--numThreads 1 (as int)
Number of threads
--container ''
--entrypoint ''
Sections
region_list_generation:
annotate_coord:
Workflow Options:
--sample-participant-lookup . (as path)
A file to map sample ID from expression to genotype,
must contain two columns, sample_id and participant_id,
mapping IDs in the expression files to IDs in the
genotype (these can be the same).
--molecular-trait-type gene
Options: gene, protein, atac
--coordinate-annotation VAL (as path, required)
gtf annotation for RNA-seq data; other types of
annotation index for protein and atac-seq
--phenotype-id-column 'gene_id'
gene_id or gene_name for RNA-seq data, SOMAseqID for
proteins for example
--auxiliary-id-mapping . (as path)
Optional mapping file (protein_name_index for protein,
etc.)
--[no-]strip-id (default to False)
Optional processing flag
annotate_coord_biomart:
Workflow Options:
--ensembl-version VAL (as int, required)
map_leafcutter_cluster_to_gene:
Workflow Options:
--intron-count VAL (as path, required)
Extract the code in case psichromatic needs to be
processed the same way PheoFile in this step is the
intron_count file
--map-stra site
Defines the mapping strategy options: 'site' or
'region', with 'site' as the default. The 'site'
strategy maps introns to the start and end of each exon.
The 'region' strategy, to be recommended in leafcutter2,
maps each intron based on it overlapping more than
overlap_ratio of a gene's region.
--overlap-ratio 0.8 (as float)
Define the overlap ratio as the proportion of the
cluster length that intersects with a gene, used to
determine mapping to the gene.
annotate_leafcutter_isoforms:
Workflow Options:
--sample-participant-lookup . (as path)
annotate_psichomics_isoforms:
Workflow Options:
--sample-participant-lookup . (as path)
Setup and global parameters#
[global]
# Work directory & output directory
parameter: cwd = path("output")
# Molecular phenotype matrix
parameter: phenoFile = path
# For cluster jobs, number commands to run per job
parameter: job_size = 1
# Wall clock time expected
parameter: walltime = "5h"
# Memory expected
parameter: mem = "16G"
# Number of threads
parameter: numThreads = 1
parameter: container = ""
parameter: entrypoint= ""
Annotate coord#
Input:#
A gtf file used to generated the bed
A phenotype bed file, must have a gene_id column indicating the name of genes.
Output:#
annotated phenotype bed file
Optional: a tsv file only has the first several columns indicating phenotype_id and names. For RNA-seq, it contains five columns: chr,start,end,gene_id,gene_name
Implementation using pyqtl
#
Implementation based on GTEx pipeline.
Following step serves to annotate coord for gene expression file.
[annotate_coord]
# A file to map sample ID from expression to genotype, must contain two columns, sample_id and participant_id, mapping IDs in the expression files to IDs in the genotype (these can be the same).
parameter: sample_participant_lookup = path()
# Options: gene, protein, atac
parameter: molecular_trait_type = "gene"
# gtf annotation for RNA-seq data; other types of annotation index for protein and atac-seq
parameter: coordinate_annotation = path
# gene_id or gene_name for RNA-seq data, SOMAseqID for proteins for example
parameter: phenotype_id_column = "gene_id"
# Optional mapping file (protein_name_index for protein, etc.)
parameter: auxiliary_id_mapping = path()
# Optional processing flag
parameter: strip_id = False
parameter: sep = "\t"
input: phenoFile, coordinate_annotation
output: f'{cwd:a}/{_input[0]:bn}.bed.gz', f'{cwd:a}/{_input[0]:bn}.region_list.txt'
task: trunk_workers = 1, trunk_size = job_size, walltime = walltime, mem = mem, tags = f'{step_name}_{_output[0]:bn}'
python: expand= "${ }", stderr = f'{_output[0]:n}.stderr', stdout = f'{_output[0]:n}.stdout'
import pandas as pd
import numpy as np
import qtl.io
from pathlib import Path
from collections import defaultdict
import warnings
# Function to convert gtf to bed
def gtf_to_bed(annotation_gtf, feature='gene', exclude_chrs=[], phenotype_id='gene_id'):
"""
Parse genes from GTF and return DataFrame with min start and max end positions.
For each gene, uses the smallest position as start and largest as end position.
"""
# Data collections
gene_data = defaultdict(lambda: {'chr': '', 'start': float('inf'), 'end': 0, 'strand': ''})
gene_names = {} # To store gene_name for each gene_id
if annotation_gtf.endswith('.gz'):
opener = gzip.open(annotation_gtf, 'rt')
else:
opener = open(annotation_gtf, 'r')
with opener as gtf:
for row in gtf:
row = row.strip().split('\t')
if row[0][0] == '#' or row[2] != feature:
continue # Skip header or non-matching features
# Parse attributes
attributes = defaultdict()
for a in row[8].replace('"', '').split(';')[:-1]:
kv = a.strip().split(' ')
if kv[0] != 'tag':
attributes[kv[0]] = kv[1]
else:
attributes.setdefault('tags', []).append(kv[1])
# Get gene identifiers
curr_gene_id = attributes['gene_id']
curr_gene_name = attributes['gene_name']
gene_names[curr_gene_id] = curr_gene_name
# Update gene data
data = gene_data[curr_gene_id]
if data['chr'] == '': # First entry for this gene
data['chr'] = row[0]
data['strand'] = row[6]
# Update min start and max end
start_pos = int(row[3]) - 1 # Convert to 0-based for BED
end_pos = int(row[4]) - 1 # Both start and end are used
data['start'] = min(data['start'], start_pos)
data['end'] = max(data['end'], end_pos)
# Convert to dataframe format
chrom, start, end, ids, names, strand = [], [], [], [], [], []
for gene_id, data in gene_data.items():
chrom.append(data['chr'])
start.append(data['start'])
end.append(data['end'])
ids.append(gene_id)
names.append(gene_names[gene_id])
strand.append(data['strand'])
# Create DataFrame based on phenotype_id
if phenotype_id == 'gene_id':
bed_df = pd.DataFrame({
'chr': chrom,
'start': start,
'end': end,
'gene_id': ids,
'strand': strand
}, columns=['chr', 'start', 'end', 'gene_id', 'strand'], index=ids)
elif phenotype_id == 'gene_name':
bed_df = pd.DataFrame({
'chr': chrom,
'start': start,
'end': end,
'gene_id': names,
'strand': strand
}, columns=['chr', 'start', 'end', 'gene_id', 'strand'], index=names)
# Filter out excluded chromosomes
mask = np.ones(len(chrom), dtype=bool)
for k in exclude_chrs:
mask = mask & (bed_df['chr'] != k)
bed_df = bed_df[mask]
# Sort by chromosome and start position
bed_df = bed_df.groupby('chr', sort=False, group_keys=False).apply(lambda x: x.sort_values(['start', 'end']))
return bed_df
def prepare_bed(df, bed_template_df, chr_subset=None):
bed_df = pd.merge(bed_template_df, df, left_index=True, right_index=True)
bed_df = bed_df.groupby('#chr', sort=False, group_keys=False).apply(lambda x: x.sort_values(['start','end']))
if chr_subset is not None:
bed_df = bed_df[bed_df.chr.isin(chr_subset)]
return bed_df
def load_and_preprocess_data(input_path, drop_columns):
df = pd.read_csv(input_path, sep="${sep}", skiprows=0)
dc = [col for col in df.columns if col in drop_columns] # Take intersection between df.columns and drop_columns
df = df.drop(dc, axis=1) # drop the intersect
if len(df.columns) < 2:
raise ValueError("There are too few columns in the loaded dataframe, please check the delimiter of the input file. The default delimiter is tab")
return df
def rename_samples_using_lookup(df, lookup_path):
lookup_file = Path(lookup_path)
if not lookup_file.is_file():
return df
if lookup_file.suffix == '.csv':
sep = ','
elif lookup_file.suffix == '.tsv':
sep = '\t'
else:
sep = '\t' # Default to tab if unknown extension
try:
sample_participant_lookup = pd.read_csv(lookup_file, sep=sep, dtype={0:str, 1:str})
if "genotype_id" in sample_participant_lookup.columns:
sample_participant_lookup = sample_participant_lookup.set_index(sample_participant_lookup.columns[1])
df.rename(columns=sample_participant_lookup.to_dict()["genotype_id"], inplace=True)
else:
# Fall back to the no header style
sample_participant_lookup = pd.read_csv(lookup_file, sep=sep, header=None, dtype={0:str, 1:str})
rename_dict = dict(zip(sample_participant_lookup[0], sample_participant_lookup[1]))
df.rename(columns=rename_dict, inplace=True)
except Exception as e:
print(f"Warning: Error processing sample lookup file: {e}")
return df
def load_bed_template_from_gtf(input_path, phenotype_id_column):
if sum(gtf_to_bed(input_path, feature='gene', phenotype_id="gene_id").index.duplicated()) > 0:
raise ValueError(f"gtf file {input_path} needs to be collapsed into gene model by reference data processing module")
bed_template_df_id = gtf_to_bed(input_path, feature='transcript', phenotype_id="gene_id")
bed_template_df_name = gtf_to_bed(input_path, feature='transcript', phenotype_id="gene_name")
bed_template_df = bed_template_df_id.merge(bed_template_df_name, on=["chr", "start", "end", "strand"])
bed_template_df.columns = ["#chr", "start", "end", "gene_id", "strand", "gene_name"]
bed_template_df = bed_template_df.set_index(phenotype_id_column, drop=False)
return bed_template_df
drop_cols = ["#chr", "chr", "start", "end", "stop", "annot.seqnames", "annot.start", "annot.end"]
df = load_and_preprocess_data(${_input[0]:ar}, drop_cols)
molecular_trait_type = "${molecular_trait_type}"
df.set_index(df.columns[0], inplace=True)
if ${True if sample_participant_lookup.is_file() else False}:
df = rename_samples_using_lookup(df, "${sample_participant_lookup:a}")
# First column is always the phenotype ID column
phenotype_id = df.columns.values[0]
# Type-specific processing
if molecular_trait_type == "gene":
if ${strip_id}:
# Update the index by stripping the pattern "A | B" to "B"
df.index = df.index.map(lambda x: x.split('|')[-1].strip() if '|' in x else x)
bed_template_df = load_bed_template_from_gtf(${_input[1]:ar}, "${phenotype_id_column}")
bed_df = prepare_bed(df, bed_template_df).drop("gene_name", axis=1)
bed_df = bed_df.drop_duplicates("gene_id", keep=False)
bed_df = bed_df.rename(columns={'gene_id': 'ID'})
elif molecular_trait_type == "protein":
auxiliary_mapping = Path("${auxiliary_id_mapping:a}")
if auxiliary_mapping.is_file():
# Use protein name mapping from auxiliary file
df_info = pd.read_csv(auxiliary_mapping).rename(
columns={phenotype_id: phenotype_id, 'EntrezGeneSymbol': 'gene_name'}
)[['gene_name', phenotype_id, 'UniProt']]
df = df_info.merge(df, on=phenotype_id).drop(phenotype_id, axis=1)
else:
# Extract UniProt ID from phenotype ID if no mapping file
df[[phenotype_id, 'UniProt']] = df[phenotype_id].astype(str).str.split('|', 1, expand=True)
bed_template_df = load_bed_template_from_gtf(${_input[1]:ar}, "${phenotype_id_column}")
bed_df = prepare_bed(df, bed_template_df)
bed_df["ID"] = bed_df["gene_id"] + "_" + bed_df["UniProt"]
bed_df = bed_df.drop_duplicates("ID", keep=False)[["#chr", "start", "end", "ID"] + df.drop(["UniProt"], axis=1).columns.values.tolist()]
elif molecular_trait_type == "atac":
# For ATAC, the coordinate_annotation is directly a BED file
bed_template_df = pd.read_csv("${_input[1]}", sep="\t").set_index("ID")
bed_template_df = bed_template_df.assign(ID=bed_template_df.index)
bed_df = prepare_bed(df, bed_template_df)
bed_df = bed_df.drop_duplicates("ID", keep=False)
else:
raise ValueError(f"Unsupported molecular_trait_type: {molecular_trait_type}")
qtl.io.write_bed(bed_df, ${_output[0]:r})
bed_df[["#chr", "start", "end", "ID", "strand"]].assign(path=${_output[0]:r}).to_csv(${_output[1]:r}, "\t", index=False)
# If annotating gene or protein, save a gene_list.tsv for gene partitioning
if molecular_trait_type in ["gene", "protein"]:
# Save the region list
region_list = bed_template_df[bed_template_df.${phenotype_id_column}.isin(df.index)]
region_list.to_csv('${cwd:a}/${_input[0]:bn}.gene_list.tsv', sep="\t", index=False)
else:
# For ATAC, we cannot partition it by gene.
warnings.warn(f"Gene partitioning is not applicable for molecular trait type: {molecular_trait_type}. No gene_list.tsv will be generated.")
bash: expand= "$[ ]", stderr = f'{_output[0]:n}.stderr', stdout = f'{_output[0]:n}.stdout', container = container, entrypoint=entrypoint
stdout=$[_output[0]:n].stdout
for i in $[_output[0]] ; do
echo "output_info: $i " >> $stdout;
echo "output_size:" `ls -lh $i | cut -f 5 -d " "` >> $stdout;
echo "output_rows:" `zcat $i | wc -l | cut -f 1 -d " "` >> $stdout;
echo "output_headerow:" `zcat $i | grep "##" | wc -l ` >> $stdout;
echo "output_column:" `zcat $i | grep -V "##" | head -1 | wc -w ` >> $stdout;
echo "output_preview:" >> $stdout;
zcat $i | grep -v "##" | head | cut -f 1,2,3,4,5,6 >> $stdout ; done
for i in $[_output[1]] ; do
echo "output_info: $i " >> $stdout;
echo "output_size:" `ls -lh $i | cut -f 5 -d " "` >> $stdout;
echo "output_rows:" `cat $i | wc -l | cut -f 1 -d " "` >> $stdout;
echo "output_headerow:" `cat $i | grep "##" | wc -l ` >> $stdout;
echo "output_column:" `cat $i | grep -V "##" | head -1 | wc -w ` >> $stdout;
echo "output_preview:" >> $stdout;
cat $i | grep -v "##" | head | cut -f 1,2,3,4,5,6 >> $stdout ; done
Implementation using biomaRt#
This workflow adds the annotations of chr pos(TSS where start = end -1) and gene_ID to the bed
file. This workflow is obsolete.
[annotate_coord_biomart]
parameter: ensembl_version=int
input: phenoFile
output: f'{cwd:a}/{_input:bn}.bed.gz',
f'{cwd:a}/{_input:bn}.region_list.txt'
task: trunk_workers = 1, trunk_size = job_size, walltime = walltime, mem = mem, tags = f'{step_name}_{_output[0]:bn}'
R: expand= "$[ ]", stderr = f'{_output[0]:n}.stderr', stdout = f'{_output[0]:n}.stdout' ,container = container, entrypoint=entrypoint
library(biomaRt)
library(dplyr)
library(readr)
# Clear biomart cache
biomartCacheClear()
# Read gene expression data
gene_exp <- read_delim("$[_input[0]]", delim = "\t")
# If column "#chr" exists, remove the first 3 columns (chr, start, and end)
if("#chr" %in% colnames(gene_exp)){
gene_exp <- gene_exp[, 4:ncol(gene_exp)]
}
# Connect to Ensembl biomart
ensembl <- useEnsembl(biomart = "ensembl", dataset = "hsapiens_gene_ensembl", version = "$[ensembl_version]")
# Get gene annotations from Ensembl
ensembl_df <- getBM(attributes = c("ensembl_gene_id", "chromosome_name", "start_position", "end_position"), mart = ensembl)
# Filter and rename columns
my_genes_ann <- ensembl_df %>%
filter(ensembl_gene_id %in% gene_exp$gene_ID, chromosome_name %in% 1:23) %>%
rename("#chr" = chromosome_name,
"start" = start_position,
"end" = end_position,
"gene_ID" = ensembl_gene_id) %>%
filter(gene_ID != "NA")
# Save the annotations
my_genes_ann %>%
select(`#chr`, start, end, gene_ID) %>%
write_delim(path = "$[_output[1]]", "\t")
# Merge the annotations with the gene expression data, after modifying 'end' to be 'start + 1'
my_gene_bed <- my_genes_ann %>%
mutate(end = start + 1) %>%
inner_join(gene_exp, by = "gene_ID") %>%
arrange(`#chr`, start)
# Save the final merged data
write_tsv(my_gene_bed, path = "$[_output[0]:n]", na = "NA", append = FALSE, col_names = TRUE, quote_escape = "double")
bash: expand = "$[ ]", stderr = f'{_output[0]}.stderr', stdout = f'{_output[0]}.stdout',container = container, entrypoint=entrypoint
bgzip -f $[_output[0]:n]
tabix -p bed $[_output[0]] -f
Annotation of leafcutter isoform#
The following steps processed the output files of leafcutter so that they are TensorQTL ready. Shown below are three intemediate files
Exon list
chr |
start |
end |
strand |
gene_id |
gene_name |
---|---|---|---|---|---|
chr1 |
29554 |
30039 |
+ |
ENSG00000243485 |
MIR1302-2HG |
chr1 |
30564 |
30667 |
+ |
ENSG00000243485 |
MIR1302-2HG |
chr1 |
30976 |
31097 |
+ |
ENSG00000243485 |
MIR1302-2HG |
chr1 |
35721 |
36081 |
- |
ENSG00000237613 |
FAM138A |
chr1 |
35277 |
35481 |
- |
ENSG00000237613 |
FAM138A |
chr1 |
34554 |
35174 |
- |
ENSG00000237613 |
FAM138A |
chr1 |
65419 |
65433 |
+ |
ENSG00000186092 |
OR4F5 |
chr1 |
65520 |
65573 |
+ |
ENSG00000186092 |
OR4F5 |
chr1 |
69037 |
71585 |
+ |
ENSG00000186092 |
OR4F5 |
clusters_to_genes
clu |
genes |
---|---|
1:clu_1_+ |
ENSG00000116288 |
1:clu_10_+ |
ENSG00000143774 |
1:clu_11_+ |
ENSG00000143774 |
1:clu_12_+ |
ENSG00000143774 |
1:clu_14_- |
ENSG00000126709 |
1:clu_15_- |
ENSG00000121753 |
1:clu_16_- |
ENSG00000121753 |
1:clu_17_- |
ENSG00000116560 |
1:clu_18_- |
ENSG00000143549 |
phenotype_group
X1 |
X2 |
---|---|
7:102476270:102478811:clu_309_-:ENSG00000005075 |
ENSG00000005075 |
7:102476270:102478808:clu_309_-:ENSG00000005075 |
ENSG00000005075 |
X:47572961:47574002:clu_349_-:ENSG00000008056 |
ENSG00000008056 |
X:47572999:47574002:clu_349_-:ENSG00000008056 |
ENSG00000008056 |
8:27236905:27239971:clu_322_-:ENSG00000015592 |
ENSG00000015592 |
8:27239279:27239971:clu_322_-:ENSG00000015592 |
ENSG00000015592 |
8:27241262:27241677:clu_323_-:ENSG00000015592 |
ENSG00000015592 |
8:27241262:27242397:clu_323_-:ENSG00000015592 |
ENSG00000015592 |
8:27241757:27242397:clu_323_-:ENSG00000015592 |
ENSG00000015592 |
1:35558223:35559107:clu_4_+:ENSG00000020129 |
ENSG00000020129 |
The gtf used here should be the collapsed gtf, i.e. the final output of reference_data gtf processing and the one used to called rnaseq.
[map_leafcutter_cluster_to_gene]
## Extract the code in case psichromatic needs to be processed the same way
## PheoFile in this step is the intron_count file
parameter: intron_count = path
# Defines the mapping strategy options: 'site' or 'region', with 'site' as the default.
# The 'site' strategy maps introns to the start and end of each exon.
# The 'region' strategy, to be recommended in leafcutter2, maps each intron based on it overlapping more than overlap_ratio of a gene's region.
parameter: map_stra = "site"
# Define the overlap ratio as the proportion of the cluster length that intersects with a gene, used to determine mapping to the gene.
parameter: overlap_ratio = 0.8
input: intron_count, annotation_gtf
output: f'{cwd}/{_input[0]:b}.exon_list', f'{cwd}/{_input[0]:b}.leafcutter.clusters_to_genes.txt'
task: trunk_workers = 1, trunk_size = job_size, walltime = walltime, mem = mem, tags = f'{step_name}_{_output[0]:bn}'
python: expand= "${ }", stderr = f'{_output[0]:n}.stderr', stdout = f'{_output[0]:n}.stdout', container = container, entrypoint=entrypoint
import pandas as pd
import qtl.annotation
gtf = ${_input[1]:r}
#there is no "gene_type" when using the uncollapsed gtf. Replace "gene_biotype" with "gene_type" in the gtf and temporarily save for use in the annotation
has_gene_type = True
with open(gtf, 'r') as file:
for line in file:
if line[0] == '#':
continue
row = line.strip().split('\t')
chrom = row[0]
# source = row[1]
annot_type = row[2]
if annot_type == "gene":
if "gene_type" not in line:
has_gene_type = False
break
if has_gene_type == False:
with open(gtf, 'r') as file:
file_contents = file.read()
updated_contents = file_contents.replace("gene_biotype", "gene_type")
gtf = ${_input[1]:rn}.tmp${_input[1]:rx}
with open(gtf, 'w') as file:
file.write(updated_contents)
# Load data
#annot = qtl.annotation.Annotation(${_input[1]:r})
annot = qtl.annotation.Annotation(gtf)
exon_df = pd.DataFrame([[g.chr, e.start_pos, e.end_pos, g.strand, g.id, g.name]
for g in annot.genes for e in g.transcripts[0].exons],
columns=['chr', 'start', 'end', 'strand', 'gene_id', 'gene_name'])
exon_df.to_csv(${_output[0]:r}, sep='\t', index=False)
R:expand= "${ }", stderr = f'{_output[0]:n}.stderr', stdout = f'{_output[0]:n}.stdout', container = container, entrypoint=entrypoint
suppressMessages(library(dplyr, quietly=TRUE))
suppressMessages(library(stringr, quietly=TRUE))
suppressMessages(library(foreach, quietly=TRUE))
# leafcutter functions:
#' Make a data.frame of meta data about the introns
#' @param introns Names of the introns
#' @return Data.frame with chr, start, end, cluster id
#' @export
get_intron_meta <- function(introns) {
intron_meta <- do.call(rbind, strsplit(introns, ":"))
if (ncol(intron_meta) == 5) {
colnames(intron_meta) <- c("chr", "start", "end", "clu", "category")
} else {
colnames(intron_meta) <- c("chr", "start", "end", "clu")
}
intron_meta <- as.data.frame(intron_meta, stringsAsFactors = FALSE)
intron_meta$start <- as.numeric(intron_meta$start)
intron_meta$end <- as.numeric(intron_meta$end)
return(intron_meta)
}
#' Work out which gene each cluster belongs to. Note the chromosome names used in the two inputs must match.
#' @param intron_meta Data frame describing the introns, usually from get_intron_meta
#' @param exons_table Table of exons, see e.g. /data/gencode19_exons.txt.gz
#' @return Data.frame with cluster ids and genes separated by commas
#' @import dplyr
#' @export
map_clusters_to_genes_site <- function(intron_meta, exons_table) {
suppressMessages(library(foreach, quietly=TRUE))
gene_df <- foreach (chr=sort(unique(intron_meta$chr)), .combine=rbind) %dopar% {
intron_chr <- intron_meta[ intron_meta$chr==chr, ]
exons_chr <- exons_table[exons_table$chr==chr, ]
exons_chr$temp <- exons_chr$start
intron_chr$temp <- intron_chr$end
three_prime_matches <- inner_join( intron_chr, exons_chr, by="temp")
exons_chr$temp <- exons_chr$end
intron_chr$temp <- intron_chr$start
five_prime_matches <- inner_join( intron_chr, exons_chr, by="temp")
all_matches <- rbind(three_prime_matches, five_prime_matches)[ , c("clu", "gene_name")]
all_matches <- all_matches[!duplicated(all_matches),]
if (nrow(all_matches)==0) return(NULL)
all_matches$clu <- paste(chr,all_matches$clu,sep=':')
all_matches
}
clu_df <- gene_df %>% group_by(clu) %>% summarize(genes=paste(gene_name, collapse = ","))
class(clu_df) <- "data.frame"
clu_df
}
map_clusters_to_genes_region <- function(intron_meta, exons_table, f) {
#combine the exon position to get gene region
merged_df <- exons_table %>%
group_by(chr, gene_id) %>%
summarize(start = min(start), end = max(end), .groups = 'drop') %>% as.data.frame()%>%
.[,c("chr","start","end","gene_id")]
# use bedtools to map the intron region to gene region
options(bedtools.path = "/home/rf2872/software/bedtools2/bin/") #Fixme: build bedtools in leafcutter sif
map_res <- bedtoolsr::bt.intersect(a = intron_meta, b = merged_df,f = f, wa = TRUE, wb=TRUE)
#organize output of mapped file
exon_map <- map_res %>%
mutate(clu = paste(V1, V4, sep=":"),
genes = .[[paste0("V", 4 + ncol(intron_meta))]]) %>%
select(clu,genes)
return(exon_map)
}
cat("LeafCutter: mapping clusters to genes\n")
intron_counts <- read.table(${_input[0]:r}, header=TRUE, check.names=FALSE, row.names=4)
intron_meta <- get_intron_meta(rownames(intron_counts))
exon_table <- read.table(${_output[0]:r}, header=TRUE, stringsAsFactors=FALSE)
if(!str_detect(intron_meta$chr[1],"chr")) {
exon_table = exon_table %>% mutate(chr = str_remove_all(chr,"chr"))
} else if (!any(str_detect(exon_table$chr[1],"chr"))) {
exon_table = exon_table %>% mutate(chr = paste0("chr",chr))
} else {exon_table = exon_table}
stopifnot(is.element('gene_id', colnames(exon_table)))
exon_table[, 'gene_name'] <- exon_table[, 'gene_id']
if("${map_stra}" == "site"){
cat("mapping clusters to genes by splicing site\n")
m <- map_clusters_to_genes_site(intron_meta, exon_table)
} else if("${map_stra}" == "region"){
cat("mapping clusters to genes by overlapping gene region\n")
m <- map_clusters_to_genes_region(intron_meta, exon_table, f = ${overlap_ratio})
} else {
stop("Map strategy should only be 'site' or 'region'.")
}
write.table(m, ${_output[1]:r}, sep = "\t", quote=FALSE, row.names=FALSE)
bash: expand= "$[ ]", stderr = f'{_output[0]:n}.stderr', stdout = f'{_output[0]:n}.stdout', container = container, entrypoint=entrypoint
stdout=$[_output[0]:n].stdout
for i in $[_output] ; do
echo "output_info: $i " >> $stdout;
echo "output_size:" `ls -lh $i | cut -f 5 -d " "` >> $stdout;
echo "output_rows:" `zcat $i | wc -l | cut -f 1 -d " "` >> $stdout;
echo "output_headerow:" `zcat $i | grep "##" | wc -l ` >> $stdout;
echo "output_column:" `zcat $i | grep -V "##" | head -1 | wc -w ` >> $stdout;
echo "output_preview:" >> $stdout;
zcat $i | grep -v "##" | head | cut -f 1,2,3,4,5,6 >> $stdout ; done
[annotate_leafcutter_isoforms]
parameter: sample_participant_lookup = path()
input: phenoFile, annotation_gtf,output_from("map_leafcutter_cluster_to_gene")
output: f'{cwd:a}/{_input[0]:bn}.formated.bed.gz', f'{cwd:a}/{_input[0]:bn}.phenotype_group.txt'
task: trunk_workers = 1, trunk_size = job_size, walltime = walltime, mem = mem, tags = f'{step_name}_{_output[0]:bn}'
python: expand= "${ }", stderr = f'{_output[0]:n}.stderr', stdout = f'{_output[0]:n}.stdout', container = container, entrypoint=entrypoint
import pandas as pd
import numpy as np
import qtl.io
import os
from pathlib import Path
# Load data
if os.path.exists(${_input[1]:rn}.tmp${_input[1]:rx}):
gtf = ${_input[1]:rn}.tmp${_input[1]:rx}
else:
gtf = ${_input[1]:r}
#tss_df = qtl.io.gtf_to_tss_bed(${_input[1]:r})
tss_df = qtl.io.gtf_to_tss_bed(gtf)
bed_df = pd.read_csv(${_input[0]:ar}, sep='\t', skiprows=0)
bed_df.columns.values[0] = "#chr" # Temporary
sample_participant_lookup = Path("${sample_participant_lookup:a}")
cluster2gene_dict = pd.read_csv(${_input[3]:r}, sep='\t', index_col=0).to_dict()
cluster2gene_dict = cluster2gene_dict['genes']
print(' ** assigning introns to gene mapping(s)')
n = 0
gene_bed_df = []
group_s = {}
for _,r in bed_df.iterrows():
s = r['ID'].split(':')
cluster_id = s[0]+':'+s[3]
if cluster_id in cluster2gene_dict:
gene_ids = cluster2gene_dict[cluster_id].split(',')
for g in gene_ids:
gi = r['ID']+':'+g
gene_bed_df.append(tss_df.loc[g, ['chr', 'start', 'end']].tolist() + [gi] + r.iloc[4:].tolist())
group_s[gi] = g
else:
n += 1
if n > 0:
print(f' ** discarded {n} introns without a gene mapping')
print(' * writing BED files for QTL mapping')
gene_bed_df = pd.DataFrame(gene_bed_df, columns=bed_df.columns)
# sort by TSS
gene_bed_df = gene_bed_df.groupby('#chr', sort=False, group_keys=False).apply(lambda x: x.sort_values('start'))
#rename the samples if they named by file name (simply pick the first element with [.])
gene_bed_df.columns = list(gene_bed_df.columns[:4]) + [name.split('.')[0] if 'junc' in name else name for name in gene_bed_df.columns[4:]]
# change sample IDs to participant IDs
if sample_participant_lookup.is_file():
sample_participant_lookup_s = pd.read_csv(sample_participant_lookup, sep="\t", index_col=0, dtype={0:str,1:str})
#gene_bed_df.rename(columns=sample_participant_lookup_s.to_dict(), inplace=True)
# Create a dictionary mapping from sample_id to participant_id
column_mapping = dict(zip(sample_participant_lookup_s.index, sample_participant_lookup_s['participant_id']))
# Get the column names to be replaced
column_names = gene_bed_df.columns[4:]
# Replace the column names using the mapping dictionary
#new_column_names = [column_mapping.get(col, 'missing_data') for col in column_names]
#it should overlap with genotype in downstream anyways
new_column_names = [column_mapping.get(col, col) for col in column_names]
gene_bed_df.rename(columns=dict(zip(column_names, new_column_names)), inplace=True)
gene_bed_df = gene_bed_df.drop_duplicates()
qtl.io.write_bed(gene_bed_df, ${_output[0]:r})
gene_bed_df[['start', 'end']] = gene_bed_df[['start', 'end']].astype(np.int32)
gene_bed_df[gene_bed_df.columns[4:]] = gene_bed_df[gene_bed_df.columns[4:]].astype(np.float32)
group_s_df = pd.Series(group_s).sort_values().reset_index()
group_s_df.columns = ['ID', 'gene']
group_s_df.to_csv(${_output[1]:r}, sep='\t', index=False, header=True)
bash: expand= "$[ ]", stderr = f'{_output[0]:n}.stderr', stdout = f'{_output[0]:n}.stdout', container = container, entrypoint=entrypoint
stdout=$[_output[0]:n].stdout
rm $[_input[1]:rn].tmp$[_input[1]:rx]
for i in $[_output[0]] ; do
echo "output_info: $i " >> $stdout;
echo "output_size:" `ls -lh $i | cut -f 5 -d " "` >> $stdout;
echo "output_rows:" `zcat $i | wc -l | cut -f 1 -d " "` >> $stdout;
echo "output_headerow:" `zcat $i | grep "##" | wc -l ` >> $stdout;
echo "output_column:" `zcat $i | grep -V "##" | head -1 | wc -w ` >> $stdout;
echo "output_preview:" >> $stdout;
zcat $i | grep -v "##" | head | cut -f 1,2,3,4,5,6 >> $stdout ; done
for i in $[_output[1]] ; do
echo "output_info: $i " >> $stdout;
echo "output_size:" `ls -lh $i | cut -f 5 -d " "` >> $stdout;
echo "output_rows:" `cat $i | wc -l | cut -f 1 -d " "` >> $stdout;
echo "output_headerow:" `cat $i | grep "##" | wc -l ` >> $stdout;
echo "output_column:" `cat $i | grep -V "##" | head -1 | wc -w ` >> $stdout;
echo "output_preview:" >> $stdout;
cat $i | grep -v "##" | head | cut -f 1,2,3,4,5,6 >> $stdout ; done
Processing of psichomics output#
It occurs that the psichomatic by default grouped the isoforms by gene name, so only thing needs to be done is to extract this information and potentially renamed the gene symbol into ENSG ID
[annotate_psichomics_isoforms]
parameter: sample_participant_lookup = path()
input: phenoFile, annotation_gtf
output: f'{cwd:a}/{_input[0]:bn}.formated.bed.gz', f'{cwd:a}/{_input[0]:bn}.phenotype_group.txt'
task: trunk_workers = 1, trunk_size = job_size, walltime = walltime, mem = mem, tags = f'{step_name}_{_output[0]:bn}'
python: expand= "${ }", stderr = f'{_output[0]:n}.stderr', stdout = f'{_output[0]:n}.stdout', container = container, entrypoint=entrypoint
import pandas as pd
import numpy as np
import qtl.io
from pathlib import Path
tss_df = qtl.io.gtf_to_tss_bed(${_input[1]:r}, feature='gene',phenotype_id = "gene_id" )
bed_df = pd.read_csv(${_input[0]:ar}, sep='\t', skiprows=0)
bed_df["gene_id"] = [x[-1] for x in bed_df.ID.str.split("_")]
sample_participant_lookup = Path("${sample_participant_lookup:a}")
if "start" in bed_df.columns:
bed_df = bed_df.drop(["#Chr","start","end"],axis = 1)
output = tss_df.merge(bed_df, left_on = ["gene_id"], right_on = ["gene_id"],how = "right").sort_values(["chr","start"])
# change sample IDs to participant IDs
if sample_participant_lookup.is_file():
sample_participant_lookup_s = pd.read_csv(sample_participant_lookup, sep="\t", index_col=0, dtype={0:str,1:str})
column_mapping = dict(zip(sample_participant_lookup_s.index, sample_participant_lookup_s['participant_id']))
column_names = output.columns[4:]
print(column_names)
new_column_names = [column_mapping.get(col, col) for col in column_names]
output.rename(columns=dict(zip(column_names, new_column_names)), inplace=True)
#output.rename(columns=sample_participant_lookup_s.to_dict(), inplace=True)
# Old code grouping by each gene
#bed_output = output.drop("gene_id" , axis = 1)
#phenotype_group = output[["ID","gene_id"]]
bed_output = output.drop("gene_id" , axis = 1)
# R version
#output = output%>%mutate(gene_type_id = map_chr(str_split( ID ,"_"),~paste0(.x[length(.x)],"_",.x[1] )) )
# corrected in python
output = output.assign(gene_type_id = output['ID'].str.split('_').map(lambda x: x[-1] + '_' + x[0]))
phenotype_group = output[["ID","gene_type_id"]]
bed_output.to_csv(${_output[0]:nr},"\t",index = False)
phenotype_group.to_csv(${_output[1]:r},"\t",index = False,header=False)
bash: expand= "${ }", stderr = f'{_output[0]:n}.stderr', stdout = f'{_output[0]:n}.stdout', container = container, entrypoint=entrypoint
bgzip -f ${_output[0]:n}
tabix ${_output[0]}
bash: expand= "$[ ]", stderr = f'{_output[0]:n}.stderr', stdout = f'{_output[0]:n}.stdout', container = container, entrypoint=entrypoint
stdout=$[_output[0]:n].stdout
for i in $[_output[0]] ; do
echo "output_info: $i " >> $stdout;
echo "output_size:" `ls -lh $i | cut -f 5 -d " "` >> $stdout;
echo "output_rows:" `zcat $i | wc -l | cut -f 1 -d " "` >> $stdout;
echo "output_headerow:" `zcat $i | grep "##" | wc -l ` >> $stdout;
echo "output_column:" `zcat $i | grep -V "##" | head -1 | wc -w ` >> $stdout;
echo "output_preview:" >> $stdout;
zcat $i | grep -v "##" | head | cut -f 1,2,3,4,5,6 >> $stdout ; done
for i in $[_output[1]] ; do
echo "output_info: $i " >> $stdout;
echo "output_size:" `ls -lh $i | cut -f 5 -d " "` >> $stdout;
echo "output_rows:" `cat $i | wc -l | cut -f 1 -d " "` >> $stdout;
echo "output_headerow:" `cat $i | grep "##" | wc -l ` >> $stdout;
echo "output_column:" `cat $i | grep -V "##" | head -1 | wc -w ` >> $stdout;
echo "output_preview:" >> $stdout;
cat $i | grep -v "##" | head | cut -f 1,2,3,4,5,6 >> $stdout ; done