Splicing QC and Normalization#

Description#

Quality control and normalization of alternative-splicing quantifications produced by the leafcutter and psichomics tools. Raw outputs are converted to BED format, features with high per-sample missingness (default 40%) are dropped, low-variance introns/events (default min variance 0.001) are removed, values are quantile-normalized, and remaining missing values are mean-imputed. The normalized tables are ready to use as molecular phenotypes for tensorQTL.

Timing: ~2-5 min on typical compute infrastructure.

Input Files#

File

Description

protocol_example.leafcutter.intron_usage_perind.counts.gz

leafcutter per-cluster intron usage counts (count/total per sample); input to leafcutter_norm.

protocol_example.leafcutter.intron_usage_perind.counts.gz_raw_data.txt

A numeric, QC-ready leafcutter intron-ratio BED table; input to the standalone leafcutter_qqnorm.

protocol_example.psichomics.psi_raw_data.tsv

psichomics PSI matrix keyed by splicing-event IDs; input to psichomics_norm.

Output Files#

File

Description

*_raw_data.qqnorm.txt

Quantile-normalized leafcutter intron-ratio phenotype table.

psichomics_raw_data_bedded.txt

psichomics PSI matrix parsed into BED format.

psichomics_raw_data_bedded.qqnorm.txt

Quantile-normalized psichomics PSI phenotype table.

Each workflow produces a quantile-normalized phenotype table in BED-like format (#Chr start end ID plus one column per sample), ready to use directly as a molecular phenotype input to TensorQTL.

Steps#

0. Generate sample list#

RNA-seq sample IDs differ from the WGS sample IDs, so the joint-call sample lookup table is first subset to the RNA-seq samples used in the sQTL analysis, matching the sample set used elsewhere in the protocol.

Step input: ROSMAP_JointCall_sample_participant_lookup_fixed — a sample comparison table for ROSMAP datasets.

Step output: ROSMAP_JointCall_sample_participant_lookup_fixed.rnaseq — the same lookup table restricted to the RNA-seq sample set.

leafcutter_norm: QC, impute, and normalize leafcutter intron ratios#

This workflow runs the full leafcutter chain in one command: it builds the phenotype table, applies an autosome filter and QC, and quantile-normalizes the intron-usage ratios. The --mean_impute option fills missing ratios with the per-intron (row) mean. See the leafcutter documentation; the choice of regtool parameters is discussed here.

Parameters. chr_blacklist is a file of blacklisted chromosomes to exclude from analysis, one per line; if none is provided, no chromosomes are excluded.

Note. leafcutter_norm_1 typically requires about 10 GB of memory (more for large inputs); insufficient memory can cause a segmentation fault.

sos run pipeline/splicing_normalization.ipynb leafcutter_norm \
    --cwd output/splicing \
    --ratios input/rnaseq/protocol_example.leafcutter.intron_usage_perind.counts.gz \
    --mean_impute 

leafcutter_qqnorm: quantile-normalize an already-QCed leafcutter table#

Use this standalone workflow when the intron-ratio table has already been QCed and imputed and only quantile normalization is needed.

sos run pipeline/splicing_normalization.ipynb leafcutter_qqnorm \
    --cwd output/splicing \
    --qced-data input/rnaseq/protocol_example.leafcutter.intron_usage_perind.counts.gz_raw_data.txt 

psichomics_norm: parse, QC, and normalize psichomics PSI values#

This workflow parses psichomics splicing-event IDs into BED coordinates, applies NA and variance filters, and then quantile-normalizes each event. See the psichomics documentation.

QC. The only QC applied to PSI values here is NA removal and a minimal variance filter; the psichomics maintainers suggest some further QC, discussed here. For reference, the default minimal variance used in leafcutter QC is 0.001.

Because the alternative-splicing event types in psichomics outputs differ, normalization is performed per event (rows) rather than per sample (columns).

sos run pipeline/splicing_normalization.ipynb psichomics_norm \
    --cwd output/splicing \
    --ratios input/rnaseq/protocol_example.psichomics.psi_raw_data.tsv 

Complete worked example#

To perform leafcutter analysis with QC (setting cluster counts of 0 as NA), imputation (flashR), and normalization, use the following commands (RECOMMENDED):

Step 1: QC#

Input: intron usage count matrix from LeafCutter clustering (*_intron_usage_perind.counts.gz), produced upstream by the splicing calling step.

Output: a QC’d count matrix in which cluster counts equal to 0 are set to NA; normalization is skipped at this stage via --no_norm.

sos run pipeline/splicing_normalization.ipynb leafcutter_norm \
    --cwd ../../output_test/leafcutter/normalize \
    --ratios ../../output_test/leafcutter/PCC_sample_list_subset_leafcutter_intron_usage_perind.counts.gz \
    --container oras://ghcr.io/cumc/leafcutter_apptainer:latest \
    --no_norm # add no norm to skip last step (qqnorm) in leafcutter_norm

Step 2: Imputation#

Input: the QC’d count matrix from Step 1, supplied through --phenoFile.

Output: an imputed phenotype matrix with missing values filled by the EBMF (flashR) model.

sos run pipeline/phenotype_imputation.ipynb EBMF \
    --phenoFile <output_from_step1> \
    --cwd ../../output_test/leafcutter/imputation \
    --prior ebnm_point_laplace --varType 1 \
    --container oras://ghcr.io/cumc/factor_analysis_apptainer:latest \
    --mem 40G --numThreads 20 --walltime 100h 

Step 3: Normalization#

Input: the imputed phenotype matrix from Step 2, supplied through --qced-data.

Output: a quantile-normalized, BED-ready table (*.qqnorm.txt) ready for downstream QTL analysis.

sos run pipeline/splicing_normalization.ipynb leafcutter_qqnorm \
    --qced-data <output_from_step2> \
    --container oras://ghcr.io/cumc/leafcutter_apptainer:latest

Or if you are going to use the default mean imputation method, you can run it with one-step command (remove –no_norm in the first step):

Input: the raw intron usage count matrix (*_intron_usage_perind.counts.gz).

Output: a quantile-normalized table (*.qqnorm.txt) produced in a single pass; missing values are filled by simple mean imputation (--mean_impute).

sos run pipeline/splicing_normalization.ipynb leafcutter_norm \
    --cwd ../../output_test/leafcutter/normalize \
    --ratios ../../output_test/leafcutter/PCC_sample_list_subset_leafcutter_intron_usage_perind.counts.gz \
    --container oras://ghcr.io/cumc/leafcutter_apptainer:latest \
    --mean_impute

Command interface#

sos run splicing_normalization.ipynb -h

Workflow implementation#

[global]
# The output directory for generated files. 
parameter: cwd = path("output")
# intron usage ratio file wiht samples after QC
# optional parameter black list if user want to blacklist some chromosomes and not to analyze
parameter: chr_blacklist = path(".")
# For cluster jobs, number commands to run per job
parameter: job_size = 1
# Wall clock time expected
parameter: walltime = "5h"
# Memory expected
parameter: mem = "16G"
# Number of threads
parameter: numThreads = 8
# Software container option
parameter: container = ""
from sos.utils import expand_size
cwd = path(f'{cwd:a}')
[Junc_list]
parameter: junc_path = path
parameter: file_suffix = "junc"
parameter: leafcutter_version = "leafcutter2"
parameter: dataset = "ROSMAP_DLPFC"
output: f'{cwd}/{leafcutter_version}/{dataset}_intron_usage_perind.junc'
bash: container=container, expand= "${ }", stderr = f'{_output[0]:n}.stderr', stdout = f'{_output[0]:n}.stderr' 
    cd ${cwd}
    realpath  ${junc_path}/*${file_suffix} > ${_output}.tmp
    grep -v "all" ${_output}.tmp > ${_output}
    rm ${_output}.tmp
[Jointcall_samples]
parameter: sample_table = path
parameter: junc_list = path
parameter: leafcutter_version = "leafcutter2"
input: sample_table, junc_list
output: f'{cwd}/{leafcutter_version}/{sample_table:b}.rnaseq'
R: container=container, expand= "${ }", stderr = f'{_output[0]:n}.stderr', stdout = f'{_output[0]:n}.stderr' 
    library(data.table)
    library(tidyverse)
    sample_lookup_hao = read_delim(${_input[0]:r}, "\t" ,col_names = T)
    sample_lookup_hao$sample_id<-sample_lookup_hao$sample_id%>%gsub(".final","",.)
    sample_ids<-list.files(paste0("${cwd}","/","${leafcutter_version}","/"),"junc$")%>%stringr::str_split(.,"[.]",simplify=T)%>%.[,1]
    junc_list<-read.table(${_input[1]:r})
    sample_ids<-junc_list$V1%>%basename%>%stringr::str_split(.,"[.]",simplify=T)%>%.[,1]
    sample_lookup_hao<-sample_lookup_hao[sample_lookup_hao$sample_id%in%sample_ids,]
    write.table(sample_lookup_hao,${_output:r},quote = F,row.names = F,sep='\t')
[leafcutter_norm_1]
parameter: ratios = path
parameter: pseudo_ratio = False 
import os
if os.path.isfile(f'{ratios:dd}/black_list.txt'):
    chr_blacklist = f'{ratios:dd}/black_list.txt'
input: ratios, group_by = 'all'
output: f'{ratios}_phenotype_file_list.txt'
task: trunk_workers = 1, trunk_size = job_size, walltime = walltime, mem = mem, cores = numThreads
python: expand= "${ }", stderr = f'{_output[0]:n}.stderr', stdout = f'{_output[0]:n}.stdout', container=container
    # code in [leafcutter_norm_1] and [leafcutter_norm_3] is modified from 
    # https://github.com/davidaknowles/leafcutter/blob/master/scripts/prepare_phenotype_table.py
    import sys
    import gzip
    import os
    import numpy as np
    import pandas as pd
    import scipy as sc
    import pickle

    from sklearn import linear_model

    def stream_table(f, ss = ''):
        fc = '#'
        while fc[0] == "#":
            fc = f.readline().strip()
            head = fc.split(ss)

        for ln in f:
            ln = ln.strip().split(ss)
            attr = {}

            for i in range(len(head)):
                try: attr[head[i]] = ln[i]
                except: break
            yield attr

    def get_chromosomes(ratio_file):
        """Get chromosomes from table. Returns set of chromosome names"""
        try: open(ratio_file)
        except:
            sys.stderr.write("Can't find %s..exiting\n"%(ratio_file))
            return
        sys.stderr.write("Parsing chromosome names...\n")
        chromosomes = set()
        with gzip.open(ratio_file, 'rt') as f:
                f.readline()
                for line in f:
                    chromosomes.add(line.split(":")[0])
        return(chromosomes)

    def get_blacklist_chromosomes(chromosome_blacklist_file):
        """
        Get list of chromosomes to ignore from a file with one blacklisted
        chromosome per line. Returns list. eg. ['X', 'Y', 'MT']
        """

        if os.path.isfile(chromosome_blacklist_file):
            with open(chromosome_blacklist_file, 'r') as f:
                return(f.read().splitlines())
        else:
            return([])

    def create_phenotype_table(ratio_file, chroms, blacklist_chroms):
        dic_pop, fout = {}, {}
        try: open(ratio_file)
        except:
            sys.stderr.write("Can't find %s..exiting\n"%(ratio_file))
            return

        sys.stderr.write("Starting...\n")
        for i in chroms:
            fout[i] = open(ratio_file+".phen_"+i, 'w')
            fout_ave = open(ratio_file+".ave", 'w')
        valRows, valRowsnn, geneRows = [], [], []
        finished = False
        header = gzip.open(ratio_file, 'rt').readline().split()[1:]

        for i in fout:
            fout[i].write("\t".join(["#chr","start", "end", "ID"]+header)+'\n')

        for dic in stream_table(gzip.open(ratio_file, 'rt'),' '):

            chrom = dic['chrom']
            chr_ = chrom.split(":")[0]
            if chr_ in blacklist_chroms: continue
            NA_indices, aveReads = [], []
            tmpvalRow = []

            i = 0
            for sample in header:

                try: count = dic[sample]
                except: print([chrom, len(dic)])
                num, denom = count.split('/')
                if float(denom) < 1:
                    count = "NA"
                    tmpvalRow.append("NA")
                    NA_indices.append(i)
                else:
                    # add a 0.5 pseudocount
                    if ${pseudo_ratio}:
                        pseudocount = 0.01 * float(denom)
                    else:
                        pseudocount = 0.5
                    count = (float(num)+pseudocount)/((float(denom))+pseudocount)
                    tmpvalRow.append(count)
                    aveReads.append(count)

            parts = chrom.split(":")

            # Check the number of parts to determine how to unpack
            if len(parts) == 5:
                chr_, s, e, clu, category = parts #leafcutter2 would have 5 columns, the last column is function category
            elif len(parts) == 4:
                chr_, s, e, clu = parts #leafcutter would have 4 columns
                category = None  # or some default value, if necessary
            else:
                raise ValueError("Unexpected format of 'chrom' string")

            if len(tmpvalRow) > 0:
                fout[chr_].write("\t".join([chr_,s,e,chrom]+[str(x) for x in tmpvalRow])+'\n')
                fout_ave.write(" ".join(["%s"%chrom]+[str(min(aveReads)), str(max(aveReads)), str(np.mean(aveReads))])+'\n')

                valRows.append(tmpvalRow)
                geneRows.append("\t".join([chr_,s,e,chrom]))
                if len(geneRows) % 1000 == 0:
                    sys.stderr.write("Parsed %s introns...\n"%len(geneRows))

        for i in fout:
            fout[i].close()

        matrix = np.array(valRows)

        # write the corrected tables

        sample_names = []

        for name in header:
            sample_names.append(name.replace('.Aligned.sortedByCoord.out.md', ''))

        fout = {}
        for i in chroms:
            fn="%s.qqnorm_%s"%(ratio_file,i)
            print("Outputting: " + fn)
            fout[i] = open(fn, 'w')
            fout[i].write("\t".join(['#Chr','start','end','ID'] + sample_names)+'\n')
        lst = []
        for i in range(len(matrix)):
            chrom, s = geneRows[i].split()[:2]

            lst.append((chrom, int(s), "\t".join([geneRows[i]] + [str(x) for x in  matrix[i]])+'\n'))

        lst.sort()
        for ln in lst:
            fout[ln[0]].write(ln[2])

        fout_run = open("%s_phenotype_file_list.txt"%ratio_file, 'w')

        fout_run.write("#chr\t#dir\n")

        for i in fout:
            fout[i].close()
            fout_run.write("%s\t"%(i))
            fout_run.write("%s.qqnorm_%s\n"%(ratio_file, i))
        fout_run.close()

    ratio_file = f'${_input}'
    chroms = get_chromosomes(f'${_input}')
    blacklist_chroms = get_blacklist_chromosomes(f'${chr_blacklist}')

    create_phenotype_table(ratio_file, chroms, blacklist_chroms)
[leafcutter_norm_2]
parameter: autosomes = True
import pandas as pd
molecular_pheno_chr_inv = pd.read_csv(f'{_input[0]}',sep = "\t")
# filter the autosomes or not
if autosomes:
    print("Analyzing autosomes 1 to 22...")
    mask = molecular_pheno_chr_inv['#chr'].str.match(r'chr(\d+)$')
    chr_numbers = molecular_pheno_chr_inv['#chr'].str.extract(r'chr(\d+)', expand=False).dropna().astype(int)
    molecular_pheno_chr_inv = molecular_pheno_chr_inv[mask & chr_numbers.between(1, 22)]
    
molecular_pheno_chr_inv = molecular_pheno_chr_inv.values.tolist()
file_inv = [x[1] for x in molecular_pheno_chr_inv]
input: file_inv # This design is necessary to avoid using for_each, as sos can not take chr number as an input.
output: f'{_input[0]:n}_raw_data.txt'
task: trunk_workers = 1, trunk_size = job_size, walltime = walltime, mem = mem, cores = numThreads
bash: expand= "${ }", stderr = f'{_output:n}.stderr', stdout = f'{_output:n}.stdout', container=container
    head -1 ${_input[0]:r}  > ${_output}
    cat ${_input:r} | grep -v "#Chr" >> ${_output}
[leafcutter_norm_3,leafcutter_qqnorm]
parameter: no_norm = False
stop_if(no_norm)
parameter: mean_impute = False
# minimal NA rate with in sample values for a possible alternative splicing event to be kept (default 0.4, chosen according to leafcutter default na rate)
parameter: na_rate = 0.4
# minimal variance across samples for a possible alternative splicing event to be kept (default 0.001, chosen according to psichomics suggested minimal variance)
parameter: min_variance = 0.001
# qced_data refers to data that has undergone leafcutter_norm1/2 normalization or has already been imputed. 
# When this data is provided, mean imputation is automatically enabled (mean imputation set to true). Thus, the default imputation method is mean imputation. 
# However, if the data provided is already imputed (i.e., contains no NA values), no further imputation will be applied.
parameter: qced_data = path(".")
parameter: bgzip=False
import os
if qced_data.is_dir():
    print('Target Qced file with suffix "_raw_data.txt"')
    input_file = paths([os.path.join(qced_data, x) for x in os.listdir(qced_data) if x.endswith('_raw_data.txt')])
    bgzip=True
else:
    print('Loading QCed data as inputed file')
    input_file = paths(qced_data)
    print('Setting mean_imput to True. If the input has already been imputed (i.e., no NA values are present), nothing will change. Otherwise, mean imputation will be applied to NA values.')
    mean_impute = True
input: input_file
output: f'{_input:n}.qqnorm.txt'
task: trunk_workers = 1, trunk_size = job_size, walltime = walltime, mem = mem, cores = numThreads
python: expand= "${ }", stderr = f'{_output[0]:n}.stderr', stdout = f'{_output[0]:n}.stdout', container=container
    import numpy as np
    import pandas as pd
    from sklearn import preprocessing

    from scipy.stats import norm
    from scipy.stats import rankdata

    def qqnorm(x):
        n=len(x)
        a=3.0/8.0 if n<=10 else 0.5
        return(norm.ppf( (rankdata(x,nan_policy="omit")-a)/(n+1.0-2.0*a) ))
        #return(norm.ppf( (rankdata(x)-a)/(n+1.0-2.0*a) ))

    raw_df = pd.read_csv(f'${_input}',sep = "\t")
    valRows = raw_df.iloc[:,4:]
    headers = list(raw_df)

    drop_list = []
    na_limit = len(valRows.columns)*${na_rate}

    for index, row in valRows.iterrows():

        # If ratio is missing for over 40% of the samples, drop
        if (row.isna().sum()) > na_limit:
            drop_list.append(index)
        # Set missing values as the mean of existed values in a row
        if ${mean_impute}:
            valRows.iloc[index] = row.fillna(row.mean())
        # else: 
        #     row.fillna(row.mean())
        # drop introns with variance smaller than some minimal value
        if np.nanstd(row) < ${min_variance}:
            drop_list.append(index)

    # save the intron information and sample values for remaining introns/rows
    newtable = raw_df.drop(drop_list).iloc[:,0:4]
    valRows = valRows.drop(drop_list)

    # scale normalize on each row
    valRows_matrix = []
    for c in (valRows.values.tolist()):
        c = preprocessing.scale(c)
        valRows_matrix.append(c)
    
    # qqnorms on the columns
    matrix = np.array(valRows_matrix)
    for i in range(len(matrix[0,:])):
        matrix[:,i] = qqnorm(matrix[:,i])
    normalized_table = pd.DataFrame(matrix)

    # reset row index for the saved intron infomation so the index will match sample values
    newtable = newtable.reset_index(drop=True)
    # merge the two parts of table
    output = pd.concat([newtable, normalized_table], axis=1)
    output.columns = headers

    # write normalized table
    output.to_csv(f'${_input:n}.qqnorm.txt', sep="\t", index=None)

bash: expand= "${ }", stderr = f'{_output[0]:n}.stderr', stdout = f'{_output[0]:n}.stdout', container=container
    if ${'true' if bgzip else 'false'}; then
        # Compress the sorted BED file
        bgzip -c "${_output}" > "${_output:n}.bed.gz"
        # Index the compressed BED file
        tabix -p bed "${_output:n}.bed.gz"
    fi
[psichomics_norm_1]
parameter: ratios = path
input: ratios, group_by = 'all'
output: f'{cwd}/psichomics_raw_data_bedded.txt' 
task: trunk_workers = 1, trunk_size = job_size, walltime = walltime, mem = mem, cores = numThreads
R: expand= "${ }", stderr = f'{_output:n}.stderr', stdout = f'{_output:n}.stdout', container=container
    library(psichomics)
    library(data.table)

    psi_data <- as.matrix(fread("${_input}"),rownames=1)
    psi_data = as.data.frame(psi_data)
  
    # Process PSI df into bed file for tensorQTL. (This part of code is modified from Ryan Yordanoff's work)
    parsed_events <- parseSplicingEvent(row.names(psi_data))
    
    # Create bedfile df and fill values with parsed values
    bed_file <- data.frame("chr"=parsed_events$chrom,"start"=parsed_events$start,"end"=parsed_events$end,"ID"=row.names(parsed_events),psi_data,check.names = FALSE)
    names(bed_file)[1] <- "#Chr"
    bed_file$'#Chr' <- sub("^", "chr", bed_file$'#Chr')
    row.names(bed_file) <- NULL   

    bed_file <- bed_file[order(bed_file$'#Chr',bed_file$start,bed_file$end),]
  
    # Create BED file output
    write.table(x=bed_file, file = "${cwd}/psichomics_raw_data_bedded.txt", quote = FALSE, row.names = FALSE, sep = "\t")
[psichomics_norm_2]
# minimal NA rate with in sample values for a possible alternative splicing event to be kept (default 0.4, chosen according to leafcutter default na rate)
parameter: na_rate = 0.4
# minimal variance across samples for a possible alternative splicing event to be kept (default 0.001, chosen according to psichomics suggested minimal variance)
parameter: min_variance = 0.001
output: f'{_input:n}.qqnorm.txt'
task: trunk_workers = 1, trunk_size = job_size, walltime = walltime, mem = mem, cores = numThreads
python: expand= "${ }", stderr = f'{_output[0]:n}.stderr', stdout = f'{_output[0]:n}.stdout', container=container
    import numpy as np
    import pandas as pd
    from sklearn import preprocessing

    from scipy.stats import norm
    from scipy.stats import rankdata

    def qqnorm(x):
        n=len(x)
        a=3.0/8.0 if n<=10 else 0.5
        return(norm.ppf( (rankdata(x,nan_policy="omit")-a)/(n+1.0-2.0*a) ))
        #return(norm.ppf( (rankdata(x)-a)/(n+1.0-2.0*a) ))

    raw_df = pd.read_csv(f'${_input}',sep = "\t")
    valRows = raw_df.iloc[:,4:]
    headers = list(raw_df)

    drop_list = []
    na_limit = len(valRows.columns)*${na_rate}

    for index, row in valRows.iterrows():

        # If ratio is missing for over 40% of the samples, drop
        if (row.isna().sum()) > na_limit:
            drop_list.append(index)
            # drop introns with variance smaller than some minimal value
        elif np.std(row) < ${min_variance}:
            drop_list.append(index)

    # save the intron information and sample values for remaining introns/rows
    newtable = raw_df.drop(drop_list).iloc[:,0:4]
    valRows = valRows.drop(drop_list)

    # scale normalize on each row
    valRows_matrix = []
    for c in (valRows.values.tolist()):
        c = preprocessing.scale(c)
        valRows_matrix.append(c)

    # qqnorms on the columns
    matrix = np.array(valRows_matrix)
    for i in range(len(matrix[0,:])):
        matrix[:,i] = qqnorm(matrix[:,i])
    normalized_table = pd.DataFrame(matrix)

    # reset row index for the saved intron infomation so the index will match sample values
    newtable = newtable.reset_index(drop=True)
    # merge the two parts of table
    output = pd.concat([newtable, normalized_table], axis=1)
    output.columns = headers

    # write normalized table, avoid scientific notation
    output.to_csv(f'${_input:n}.qqnorm.txt', sep="\t", index=None, float_format='%.16f')