Splicing QC and Normalization#
Description#
Quality control and normalization of alternative-splicing quantifications produced by the leafcutter and psichomics tools. Raw outputs are converted to BED format, features with high per-sample missingness (default 40%) are dropped, low-variance introns/events (default min variance 0.001) are removed, values are quantile-normalized, and remaining missing values are mean-imputed. The normalized tables are ready to use as molecular phenotypes for tensorQTL.
Timing: ~2-5 min on typical compute infrastructure.
Input Files#
File |
Description |
|---|---|
|
leafcutter per-cluster intron usage counts ( |
|
A numeric, QC-ready leafcutter intron-ratio BED table; input to the standalone |
|
psichomics PSI matrix keyed by splicing-event IDs; input to |
Output Files#
File |
Description |
|---|---|
|
Quantile-normalized leafcutter intron-ratio phenotype table. |
|
psichomics PSI matrix parsed into BED format. |
|
Quantile-normalized psichomics PSI phenotype table. |
Each workflow produces a quantile-normalized phenotype table in BED-like format (#Chr start end ID plus one column per sample), ready to use directly as a molecular phenotype input to TensorQTL.
Steps#
0. Generate sample list#
RNA-seq sample IDs differ from the WGS sample IDs, so the joint-call sample lookup table is first subset to the RNA-seq samples used in the sQTL analysis, matching the sample set used elsewhere in the protocol.
Step input: ROSMAP_JointCall_sample_participant_lookup_fixed — a sample comparison table for ROSMAP datasets.
Step output: ROSMAP_JointCall_sample_participant_lookup_fixed.rnaseq — the same lookup table restricted to the RNA-seq sample set.
leafcutter_norm: QC, impute, and normalize leafcutter intron ratios#
This workflow runs the full leafcutter chain in one command: it builds the phenotype table, applies an autosome filter and QC, and quantile-normalizes the intron-usage ratios. The --mean_impute option fills missing ratios with the per-intron (row) mean. See the leafcutter documentation; the choice of regtool parameters is discussed here.
Parameters. chr_blacklist is a file of blacklisted chromosomes to exclude from analysis, one per line; if none is provided, no chromosomes are excluded.
Note. leafcutter_norm_1 typically requires about 10 GB of memory (more for large inputs); insufficient memory can cause a segmentation fault.
sos run pipeline/splicing_normalization.ipynb leafcutter_norm \
--cwd output/splicing \
--ratios input/rnaseq/protocol_example.leafcutter.intron_usage_perind.counts.gz \
--mean_impute
leafcutter_qqnorm: quantile-normalize an already-QCed leafcutter table#
Use this standalone workflow when the intron-ratio table has already been QCed and imputed and only quantile normalization is needed.
sos run pipeline/splicing_normalization.ipynb leafcutter_qqnorm \
--cwd output/splicing \
--qced-data input/rnaseq/protocol_example.leafcutter.intron_usage_perind.counts.gz_raw_data.txt
psichomics_norm: parse, QC, and normalize psichomics PSI values#
This workflow parses psichomics splicing-event IDs into BED coordinates, applies NA and variance filters, and then quantile-normalizes each event. See the psichomics documentation.
QC. The only QC applied to PSI values here is NA removal and a minimal variance filter; the psichomics maintainers suggest some further QC, discussed here. For reference, the default minimal variance used in leafcutter QC is 0.001.
Because the alternative-splicing event types in psichomics outputs differ, normalization is performed per event (rows) rather than per sample (columns).
sos run pipeline/splicing_normalization.ipynb psichomics_norm \
--cwd output/splicing \
--ratios input/rnaseq/protocol_example.psichomics.psi_raw_data.tsv
Complete worked example#
To perform leafcutter analysis with QC (setting cluster counts of 0 as NA), imputation (flashR), and normalization, use the following commands (RECOMMENDED):
Step 1: QC#
Input: intron usage count matrix from LeafCutter clustering (*_intron_usage_perind.counts.gz), produced upstream by the splicing calling step.
Output: a QC’d count matrix in which cluster counts equal to 0 are set to NA; normalization is skipped at this stage via --no_norm.
sos run pipeline/splicing_normalization.ipynb leafcutter_norm \
--cwd ../../output_test/leafcutter/normalize \
--ratios ../../output_test/leafcutter/PCC_sample_list_subset_leafcutter_intron_usage_perind.counts.gz \
--container oras://ghcr.io/cumc/leafcutter_apptainer:latest \
--no_norm # add no norm to skip last step (qqnorm) in leafcutter_norm
Step 2: Imputation#
Input: the QC’d count matrix from Step 1, supplied through --phenoFile.
Output: an imputed phenotype matrix with missing values filled by the EBMF (flashR) model.
sos run pipeline/phenotype_imputation.ipynb EBMF \
--phenoFile <output_from_step1> \
--cwd ../../output_test/leafcutter/imputation \
--prior ebnm_point_laplace --varType 1 \
--container oras://ghcr.io/cumc/factor_analysis_apptainer:latest \
--mem 40G --numThreads 20 --walltime 100h
Step 3: Normalization#
Input: the imputed phenotype matrix from Step 2, supplied through --qced-data.
Output: a quantile-normalized, BED-ready table (*.qqnorm.txt) ready for downstream QTL analysis.
sos run pipeline/splicing_normalization.ipynb leafcutter_qqnorm \
--qced-data <output_from_step2> \
--container oras://ghcr.io/cumc/leafcutter_apptainer:latest
Or if you are going to use the default mean imputation method, you can run it with one-step command (remove –no_norm in the first step):
Input: the raw intron usage count matrix (*_intron_usage_perind.counts.gz).
Output: a quantile-normalized table (*.qqnorm.txt) produced in a single pass; missing values are filled by simple mean imputation (--mean_impute).
sos run pipeline/splicing_normalization.ipynb leafcutter_norm \
--cwd ../../output_test/leafcutter/normalize \
--ratios ../../output_test/leafcutter/PCC_sample_list_subset_leafcutter_intron_usage_perind.counts.gz \
--container oras://ghcr.io/cumc/leafcutter_apptainer:latest \
--mean_impute
Command interface#
sos run splicing_normalization.ipynb -h
Workflow implementation#
[global]
# The output directory for generated files.
parameter: cwd = path("output")
# intron usage ratio file wiht samples after QC
# optional parameter black list if user want to blacklist some chromosomes and not to analyze
parameter: chr_blacklist = path(".")
# For cluster jobs, number commands to run per job
parameter: job_size = 1
# Wall clock time expected
parameter: walltime = "5h"
# Memory expected
parameter: mem = "16G"
# Number of threads
parameter: numThreads = 8
# Software container option
parameter: container = ""
from sos.utils import expand_size
cwd = path(f'{cwd:a}')
[Junc_list]
parameter: junc_path = path
parameter: file_suffix = "junc"
parameter: leafcutter_version = "leafcutter2"
parameter: dataset = "ROSMAP_DLPFC"
output: f'{cwd}/{leafcutter_version}/{dataset}_intron_usage_perind.junc'
bash: container=container, expand= "${ }", stderr = f'{_output[0]:n}.stderr', stdout = f'{_output[0]:n}.stderr'
cd ${cwd}
realpath ${junc_path}/*${file_suffix} > ${_output}.tmp
grep -v "all" ${_output}.tmp > ${_output}
rm ${_output}.tmp
[Jointcall_samples]
parameter: sample_table = path
parameter: junc_list = path
parameter: leafcutter_version = "leafcutter2"
input: sample_table, junc_list
output: f'{cwd}/{leafcutter_version}/{sample_table:b}.rnaseq'
R: container=container, expand= "${ }", stderr = f'{_output[0]:n}.stderr', stdout = f'{_output[0]:n}.stderr'
library(data.table)
library(tidyverse)
sample_lookup_hao = read_delim(${_input[0]:r}, "\t" ,col_names = T)
sample_lookup_hao$sample_id<-sample_lookup_hao$sample_id%>%gsub(".final","",.)
sample_ids<-list.files(paste0("${cwd}","/","${leafcutter_version}","/"),"junc$")%>%stringr::str_split(.,"[.]",simplify=T)%>%.[,1]
junc_list<-read.table(${_input[1]:r})
sample_ids<-junc_list$V1%>%basename%>%stringr::str_split(.,"[.]",simplify=T)%>%.[,1]
sample_lookup_hao<-sample_lookup_hao[sample_lookup_hao$sample_id%in%sample_ids,]
write.table(sample_lookup_hao,${_output:r},quote = F,row.names = F,sep='\t')
[leafcutter_norm_1]
parameter: ratios = path
parameter: pseudo_ratio = False
import os
if os.path.isfile(f'{ratios:dd}/black_list.txt'):
chr_blacklist = f'{ratios:dd}/black_list.txt'
input: ratios, group_by = 'all'
output: f'{ratios}_phenotype_file_list.txt'
task: trunk_workers = 1, trunk_size = job_size, walltime = walltime, mem = mem, cores = numThreads
python: expand= "${ }", stderr = f'{_output[0]:n}.stderr', stdout = f'{_output[0]:n}.stdout', container=container
# code in [leafcutter_norm_1] and [leafcutter_norm_3] is modified from
# https://github.com/davidaknowles/leafcutter/blob/master/scripts/prepare_phenotype_table.py
import sys
import gzip
import os
import numpy as np
import pandas as pd
import scipy as sc
import pickle
from sklearn import linear_model
def stream_table(f, ss = ''):
fc = '#'
while fc[0] == "#":
fc = f.readline().strip()
head = fc.split(ss)
for ln in f:
ln = ln.strip().split(ss)
attr = {}
for i in range(len(head)):
try: attr[head[i]] = ln[i]
except: break
yield attr
def get_chromosomes(ratio_file):
"""Get chromosomes from table. Returns set of chromosome names"""
try: open(ratio_file)
except:
sys.stderr.write("Can't find %s..exiting\n"%(ratio_file))
return
sys.stderr.write("Parsing chromosome names...\n")
chromosomes = set()
with gzip.open(ratio_file, 'rt') as f:
f.readline()
for line in f:
chromosomes.add(line.split(":")[0])
return(chromosomes)
def get_blacklist_chromosomes(chromosome_blacklist_file):
"""
Get list of chromosomes to ignore from a file with one blacklisted
chromosome per line. Returns list. eg. ['X', 'Y', 'MT']
"""
if os.path.isfile(chromosome_blacklist_file):
with open(chromosome_blacklist_file, 'r') as f:
return(f.read().splitlines())
else:
return([])
def create_phenotype_table(ratio_file, chroms, blacklist_chroms):
dic_pop, fout = {}, {}
try: open(ratio_file)
except:
sys.stderr.write("Can't find %s..exiting\n"%(ratio_file))
return
sys.stderr.write("Starting...\n")
for i in chroms:
fout[i] = open(ratio_file+".phen_"+i, 'w')
fout_ave = open(ratio_file+".ave", 'w')
valRows, valRowsnn, geneRows = [], [], []
finished = False
header = gzip.open(ratio_file, 'rt').readline().split()[1:]
for i in fout:
fout[i].write("\t".join(["#chr","start", "end", "ID"]+header)+'\n')
for dic in stream_table(gzip.open(ratio_file, 'rt'),' '):
chrom = dic['chrom']
chr_ = chrom.split(":")[0]
if chr_ in blacklist_chroms: continue
NA_indices, aveReads = [], []
tmpvalRow = []
i = 0
for sample in header:
try: count = dic[sample]
except: print([chrom, len(dic)])
num, denom = count.split('/')
if float(denom) < 1:
count = "NA"
tmpvalRow.append("NA")
NA_indices.append(i)
else:
# add a 0.5 pseudocount
if ${pseudo_ratio}:
pseudocount = 0.01 * float(denom)
else:
pseudocount = 0.5
count = (float(num)+pseudocount)/((float(denom))+pseudocount)
tmpvalRow.append(count)
aveReads.append(count)
parts = chrom.split(":")
# Check the number of parts to determine how to unpack
if len(parts) == 5:
chr_, s, e, clu, category = parts #leafcutter2 would have 5 columns, the last column is function category
elif len(parts) == 4:
chr_, s, e, clu = parts #leafcutter would have 4 columns
category = None # or some default value, if necessary
else:
raise ValueError("Unexpected format of 'chrom' string")
if len(tmpvalRow) > 0:
fout[chr_].write("\t".join([chr_,s,e,chrom]+[str(x) for x in tmpvalRow])+'\n')
fout_ave.write(" ".join(["%s"%chrom]+[str(min(aveReads)), str(max(aveReads)), str(np.mean(aveReads))])+'\n')
valRows.append(tmpvalRow)
geneRows.append("\t".join([chr_,s,e,chrom]))
if len(geneRows) % 1000 == 0:
sys.stderr.write("Parsed %s introns...\n"%len(geneRows))
for i in fout:
fout[i].close()
matrix = np.array(valRows)
# write the corrected tables
sample_names = []
for name in header:
sample_names.append(name.replace('.Aligned.sortedByCoord.out.md', ''))
fout = {}
for i in chroms:
fn="%s.qqnorm_%s"%(ratio_file,i)
print("Outputting: " + fn)
fout[i] = open(fn, 'w')
fout[i].write("\t".join(['#Chr','start','end','ID'] + sample_names)+'\n')
lst = []
for i in range(len(matrix)):
chrom, s = geneRows[i].split()[:2]
lst.append((chrom, int(s), "\t".join([geneRows[i]] + [str(x) for x in matrix[i]])+'\n'))
lst.sort()
for ln in lst:
fout[ln[0]].write(ln[2])
fout_run = open("%s_phenotype_file_list.txt"%ratio_file, 'w')
fout_run.write("#chr\t#dir\n")
for i in fout:
fout[i].close()
fout_run.write("%s\t"%(i))
fout_run.write("%s.qqnorm_%s\n"%(ratio_file, i))
fout_run.close()
ratio_file = f'${_input}'
chroms = get_chromosomes(f'${_input}')
blacklist_chroms = get_blacklist_chromosomes(f'${chr_blacklist}')
create_phenotype_table(ratio_file, chroms, blacklist_chroms)
[leafcutter_norm_2]
parameter: autosomes = True
import pandas as pd
molecular_pheno_chr_inv = pd.read_csv(f'{_input[0]}',sep = "\t")
# filter the autosomes or not
if autosomes:
print("Analyzing autosomes 1 to 22...")
mask = molecular_pheno_chr_inv['#chr'].str.match(r'chr(\d+)$')
chr_numbers = molecular_pheno_chr_inv['#chr'].str.extract(r'chr(\d+)', expand=False).dropna().astype(int)
molecular_pheno_chr_inv = molecular_pheno_chr_inv[mask & chr_numbers.between(1, 22)]
molecular_pheno_chr_inv = molecular_pheno_chr_inv.values.tolist()
file_inv = [x[1] for x in molecular_pheno_chr_inv]
input: file_inv # This design is necessary to avoid using for_each, as sos can not take chr number as an input.
output: f'{_input[0]:n}_raw_data.txt'
task: trunk_workers = 1, trunk_size = job_size, walltime = walltime, mem = mem, cores = numThreads
bash: expand= "${ }", stderr = f'{_output:n}.stderr', stdout = f'{_output:n}.stdout', container=container
head -1 ${_input[0]:r} > ${_output}
cat ${_input:r} | grep -v "#Chr" >> ${_output}
[leafcutter_norm_3,leafcutter_qqnorm]
parameter: no_norm = False
stop_if(no_norm)
parameter: mean_impute = False
# minimal NA rate with in sample values for a possible alternative splicing event to be kept (default 0.4, chosen according to leafcutter default na rate)
parameter: na_rate = 0.4
# minimal variance across samples for a possible alternative splicing event to be kept (default 0.001, chosen according to psichomics suggested minimal variance)
parameter: min_variance = 0.001
# qced_data refers to data that has undergone leafcutter_norm1/2 normalization or has already been imputed.
# When this data is provided, mean imputation is automatically enabled (mean imputation set to true). Thus, the default imputation method is mean imputation.
# However, if the data provided is already imputed (i.e., contains no NA values), no further imputation will be applied.
parameter: qced_data = path(".")
parameter: bgzip=False
import os
if qced_data.is_dir():
print('Target Qced file with suffix "_raw_data.txt"')
input_file = paths([os.path.join(qced_data, x) for x in os.listdir(qced_data) if x.endswith('_raw_data.txt')])
bgzip=True
else:
print('Loading QCed data as inputed file')
input_file = paths(qced_data)
print('Setting mean_imput to True. If the input has already been imputed (i.e., no NA values are present), nothing will change. Otherwise, mean imputation will be applied to NA values.')
mean_impute = True
input: input_file
output: f'{_input:n}.qqnorm.txt'
task: trunk_workers = 1, trunk_size = job_size, walltime = walltime, mem = mem, cores = numThreads
python: expand= "${ }", stderr = f'{_output[0]:n}.stderr', stdout = f'{_output[0]:n}.stdout', container=container
import numpy as np
import pandas as pd
from sklearn import preprocessing
from scipy.stats import norm
from scipy.stats import rankdata
def qqnorm(x):
n=len(x)
a=3.0/8.0 if n<=10 else 0.5
return(norm.ppf( (rankdata(x,nan_policy="omit")-a)/(n+1.0-2.0*a) ))
#return(norm.ppf( (rankdata(x)-a)/(n+1.0-2.0*a) ))
raw_df = pd.read_csv(f'${_input}',sep = "\t")
valRows = raw_df.iloc[:,4:]
headers = list(raw_df)
drop_list = []
na_limit = len(valRows.columns)*${na_rate}
for index, row in valRows.iterrows():
# If ratio is missing for over 40% of the samples, drop
if (row.isna().sum()) > na_limit:
drop_list.append(index)
# Set missing values as the mean of existed values in a row
if ${mean_impute}:
valRows.iloc[index] = row.fillna(row.mean())
# else:
# row.fillna(row.mean())
# drop introns with variance smaller than some minimal value
if np.nanstd(row) < ${min_variance}:
drop_list.append(index)
# save the intron information and sample values for remaining introns/rows
newtable = raw_df.drop(drop_list).iloc[:,0:4]
valRows = valRows.drop(drop_list)
# scale normalize on each row
valRows_matrix = []
for c in (valRows.values.tolist()):
c = preprocessing.scale(c)
valRows_matrix.append(c)
# qqnorms on the columns
matrix = np.array(valRows_matrix)
for i in range(len(matrix[0,:])):
matrix[:,i] = qqnorm(matrix[:,i])
normalized_table = pd.DataFrame(matrix)
# reset row index for the saved intron infomation so the index will match sample values
newtable = newtable.reset_index(drop=True)
# merge the two parts of table
output = pd.concat([newtable, normalized_table], axis=1)
output.columns = headers
# write normalized table
output.to_csv(f'${_input:n}.qqnorm.txt', sep="\t", index=None)
bash: expand= "${ }", stderr = f'{_output[0]:n}.stderr', stdout = f'{_output[0]:n}.stdout', container=container
if ${'true' if bgzip else 'false'}; then
# Compress the sorted BED file
bgzip -c "${_output}" > "${_output:n}.bed.gz"
# Index the compressed BED file
tabix -p bed "${_output:n}.bed.gz"
fi
[psichomics_norm_1]
parameter: ratios = path
input: ratios, group_by = 'all'
output: f'{cwd}/psichomics_raw_data_bedded.txt'
task: trunk_workers = 1, trunk_size = job_size, walltime = walltime, mem = mem, cores = numThreads
R: expand= "${ }", stderr = f'{_output:n}.stderr', stdout = f'{_output:n}.stdout', container=container
library(psichomics)
library(data.table)
psi_data <- as.matrix(fread("${_input}"),rownames=1)
psi_data = as.data.frame(psi_data)
# Process PSI df into bed file for tensorQTL. (This part of code is modified from Ryan Yordanoff's work)
parsed_events <- parseSplicingEvent(row.names(psi_data))
# Create bedfile df and fill values with parsed values
bed_file <- data.frame("chr"=parsed_events$chrom,"start"=parsed_events$start,"end"=parsed_events$end,"ID"=row.names(parsed_events),psi_data,check.names = FALSE)
names(bed_file)[1] <- "#Chr"
bed_file$'#Chr' <- sub("^", "chr", bed_file$'#Chr')
row.names(bed_file) <- NULL
bed_file <- bed_file[order(bed_file$'#Chr',bed_file$start,bed_file$end),]
# Create BED file output
write.table(x=bed_file, file = "${cwd}/psichomics_raw_data_bedded.txt", quote = FALSE, row.names = FALSE, sep = "\t")
[psichomics_norm_2]
# minimal NA rate with in sample values for a possible alternative splicing event to be kept (default 0.4, chosen according to leafcutter default na rate)
parameter: na_rate = 0.4
# minimal variance across samples for a possible alternative splicing event to be kept (default 0.001, chosen according to psichomics suggested minimal variance)
parameter: min_variance = 0.001
output: f'{_input:n}.qqnorm.txt'
task: trunk_workers = 1, trunk_size = job_size, walltime = walltime, mem = mem, cores = numThreads
python: expand= "${ }", stderr = f'{_output[0]:n}.stderr', stdout = f'{_output[0]:n}.stdout', container=container
import numpy as np
import pandas as pd
from sklearn import preprocessing
from scipy.stats import norm
from scipy.stats import rankdata
def qqnorm(x):
n=len(x)
a=3.0/8.0 if n<=10 else 0.5
return(norm.ppf( (rankdata(x,nan_policy="omit")-a)/(n+1.0-2.0*a) ))
#return(norm.ppf( (rankdata(x)-a)/(n+1.0-2.0*a) ))
raw_df = pd.read_csv(f'${_input}',sep = "\t")
valRows = raw_df.iloc[:,4:]
headers = list(raw_df)
drop_list = []
na_limit = len(valRows.columns)*${na_rate}
for index, row in valRows.iterrows():
# If ratio is missing for over 40% of the samples, drop
if (row.isna().sum()) > na_limit:
drop_list.append(index)
# drop introns with variance smaller than some minimal value
elif np.std(row) < ${min_variance}:
drop_list.append(index)
# save the intron information and sample values for remaining introns/rows
newtable = raw_df.drop(drop_list).iloc[:,0:4]
valRows = valRows.drop(drop_list)
# scale normalize on each row
valRows_matrix = []
for c in (valRows.values.tolist()):
c = preprocessing.scale(c)
valRows_matrix.append(c)
# qqnorms on the columns
matrix = np.array(valRows_matrix)
for i in range(len(matrix[0,:])):
matrix[:,i] = qqnorm(matrix[:,i])
normalized_table = pd.DataFrame(matrix)
# reset row index for the saved intron infomation so the index will match sample values
newtable = newtable.reset_index(drop=True)
# merge the two parts of table
output = pd.concat([newtable, normalized_table], axis=1)
output.columns = headers
# write normalized table, avoid scientific notation
output.to_csv(f'${_input:n}.qqnorm.txt', sep="\t", index=None, float_format='%.16f')