snRNA-seq Preprocessing

snRNA-seq Preprocessing#

Description#

This pipeline preprocesses single-nuclei RNA-seq data in two stages. First it performs quality control with SCTK and Seurat: removing low-quality cells by mitochondrial content, total counts (nUMI) and detected genes (nFeature), filtering ambient RNA contamination with decontX, and removing doublets with a user-selected method (scds, scrublet, doubletFinder, or doubletCells). It then annotates cell types on the QC-filtered data using SingleR against a reference, transferring cluster-majority labels to the full Seurat object so the result is ready for pseudobulk aggregation.

Input Files#

File	Description
`input/snrnaseq/protocol_example.snrnaseq.cellranger/`	CellRanger output directory (contains the per-sample counts folder with `filtered_feature_bc_matrix`)
`input/snrnaseq/protocol_example.snrnaseq.id_mapping.csv`	Sample barcode-to-patient ID mapping (columns: `libraryBatch`, `cellBarcode`, `individualID`)
`input/snrnaseq/protocol_example.snrnaseq.seurat_ref_SE.rds`	SingleR reference SummarizedExperiment object (used in Step 2)

Steps#

Step 1. QC filtering (sctk_qc)#

Runs SCTK/Seurat QC on the CellRanger counts: it drops cells failing the mitochondrial-percent, minimum-count (nUMI) and minimum-gene (nFeature) thresholds, removes ambient RNA with decontX, and removes doublets with the chosen method (default scds). The sample-meta CSV maps cell barcodes to individual IDs. The result is a QC-filtered, clustered Seurat object.

sos run pipeline/snRNAseq_preprocessing.ipynb sctk_qc \
    --input-dir input/snrnaseq/protocol_example.snrnaseq.cellranger \
    --output-dir output/snrna_seq \
    --sample-meta input/snrnaseq/protocol_example.snrnaseq.id_mapping.csv

Step 2. Cell type annotation (cell_annotation)#

Annotates cell types on the QC-filtered object using SingleR against the reference SummarizedExperiment. Cells are downsampled per cluster (default 1000) for the SingleR call, then a cluster-majority label is assigned when at least cluster_threshold (default 0.90) of the cluster agrees, and transferred back to the full Seurat object.

sos run pipeline/snRNAseq_preprocessing.ipynb cell_annotation \
    --sctk-rds output/snrna_seq/SCTK_results/filtered_seuratobj.rds \
    --output-dir output/snrna_seq \
    --seurat-ref input/snrnaseq/protocol_example.snrnaseq.seurat_ref_SE.rds

Output Files#

File	Description
`output/snrna_seq/SCTK_results/filtered_seuratobj.rds`	QC-filtered, clustered Seurat object (from Step 1)
`output/snrna_seq/QC_table/SCTK_QC_table.csv`	Per-sample SCTK QC metrics
`output/snrna_seq/SCTK_results/QC_summary.csv`	Cell/gene counts at each QC step
`output/snrna_seq/annotated_seuratobj.rds`	Downsampled Seurat object with per-cell SingleR labels (from Step 2)
`output/snrna_seq/celltyped_seuratobj.rds`	Full Seurat object with cluster-majority cell type labels (from Step 2)

Anticipated Results#

Step 1 yields a QC-filtered Seurat object plus QC tables summarizing how many cells and genes survive each filtering step. Step 2 adds cell type labels, producing a celltype-labeled Seurat object ready for downstream pseudobulk aggregation.

[global]
parameter: cwd = path("output")
parameter: job_size = 1
parameter: walltime = "5h"
parameter: mem = "16G"
parameter: numThreads = 8
parameter: container = ""

import re
from sos.utils import expand_size

) if container else ""

cwd = path(f'{cwd:a}')

[sctk_qc]
parameter: input_dir          = str
parameter: output_dir         = str
parameter: sample_meta        = str
parameter: max_mito_percent   = 5
parameter: min_ncounts        = 250
parameter: min_ngenes         = 500
parameter: doublet_method     = 'scds'      # options: 'scds', 'scrublet', 'doubletFinder', 'doubletCells', 'none'
parameter: scds_call          = 'Singlet'
parameter: scrublet_threshold = 0.25
parameter: ambient_method     = 'decontX'   # options: 'decontX', 'none'
parameter: max_decontXsc      = 0.5

import os
input:  []
output: f'{output_dir}/SCTK_results/filtered_seuratobj.rds',
        f'{output_dir}/QC_table/SCTK_QC_table.csv'
task: trunk_workers = 1, trunk_size = 1, walltime = '36:00:00', mem = '256G', cores = 16
R: expand = "${ }", stdout = f'{_output[0]:n}.stdout', stderr = f'{_output[0]:n}.stderr'

    options(future.globals.maxSize = 8 * 1024^3)
    library(future)
    plan(multicore, workers = 16)

    library(Seurat)
    library(singleCellTK)
    library(tidyverse)
    library(SummarizedExperiment)
    library(scuttle)
    library(dplyr)
    library(SCOPfunctions)

    input_dir      <- "${input_dir}"
    qc_table_out   <- "${_output[1]}"
    rds_out        <- "${_output[0]}"
    doublet_method <- "${doublet_method}"
    ambient_method <- "${ambient_method}"

    # Validate method choices
    valid_doublet <- c("scds", "scrublet", "doubletFinder", "doubletCells", "none")
    valid_ambient <- c("decontX", "none")
    if (!doublet_method %in% valid_doublet)
      stop("Invalid doublet_method '", doublet_method, "'. Choose from: ", paste(valid_doublet, collapse = ", "))
    if (!ambient_method %in% valid_ambient)
      stop("Invalid ambient_method '", ambient_method, "'. Choose from: ", paste(valid_ambient, collapse = ", "))

    message("Doublet detection method : ", doublet_method)
    message("Ambient RNA removal method: ", ambient_method)

    # ── Create output directories ─────────────────────────────────
    dir.create(dirname(rds_out),      showWarnings = FALSE, recursive = TRUE)
    dir.create(dirname(qc_table_out), showWarnings = FALSE, recursive = TRUE)

    # ── QC summary table ──────────────────────────────────────────
    qc_summary <- data.frame(Step = character(),
                              Cells = numeric(), Genes = numeric(),
                              stringsAsFactors = FALSE)
    update_summary <- function(step, obj) {
      qc_summary <<- rbind(qc_summary, data.frame(
        Step  = step,
        Cells = ncol(obj), Genes = nrow(obj)))
    }

    # ── Sample metadata ───────────────────────────────────────────
    samplelist <- read.csv("${sample_meta}")
    samplelist$column_name <- paste0(samplelist$libraryBatch, "-counts_", samplelist$cellBarcode)
    samplelist <- samplelist[, c("individualID", "column_name")]
    colnames(samplelist) <- c("sample", "column_name")

    # ── Helper: convert SCE to Seurat ─────────────────────────────
    convert_to_seurat <- function(sce) {
      obj <- CreateSeuratObject(counts = counts(sce),
                                meta.data = as.data.frame(colData(sce)),
                                min.cells = ncol(sce) * 0.005)
      update_summary("Removing Genes expressed in < 0.5% of cells", obj)
      return(obj)
    }

    # ── Helper: assign sample IDs, convert back to SCE ────────────
    process_sce <- function(seurat_obj, samplelist) {
      seurat_obj <- seurat_obj[, colnames(seurat_obj) %in% samplelist$column_name]
      update_summary("Removing Cells with no associated Patient ID", seurat_obj)
      seurat_obj@meta.data$sample <- NULL
      seurat_obj@meta.data <- seurat_obj@meta.data %>%
        inner_join(samplelist, by = "column_name")
      seurat_obj@meta.data <- seurat_obj@meta.data[
        !duplicated(seurat_obj@meta.data$column_name), ]
      rownames(seurat_obj@meta.data) <- seurat_obj@meta.data$column_name
      return(as.SingleCellExperiment(seurat_obj))
    }

    # ── Helper: build QC algorithms list from chosen methods ──────
    # Doublet detection methods (runCellQC YAML names):
    #   "scds"          → cxds_bcds_hybrid  (cxds + bcds hybrid score)
    #   "scrublet"      → scrublet
    #   "doubletFinder" → doubletFinder
    #   "doubletCells"  → doubletCells      (scran-based)
    #   "none"          → skip doublet detection
    # Ambient RNA methods:
    #   "decontX"       → decontX
    #   "none"          → skip ambient RNA removal
    build_algorithms <- function(doublet_method, ambient_method) {
      algos <- "QCMetrics"
      if (doublet_method == "scds")          algos <- c(algos, "cxds_bcds_hybrid")
      if (doublet_method == "scrublet")      algos <- c(algos, "scrublet")
      if (doublet_method == "doubletFinder") algos <- c(algos, "doubletFinder")
      if (doublet_method == "doubletCells")  algos <- c(algos, "doubletCells")
      if (ambient_method == "decontX")       algos <- c(algos, "decontX")
      return(algos)
    }

    # ── Helper: build filter strings from chosen methods ──────────
    build_filters <- function(doublet_method, ambient_method) {
      filters <- c(
        "mito_percent < ${max_mito_percent}",
        "total > ${min_ncounts}",
        "detected > ${min_ngenes}"
      )
      if (doublet_method == "scds") {
        filters <- c(filters, "scds_hybrid_call=='${scds_call}'")
      } else if (doublet_method == "scrublet") {
        filters <- c(filters, "scrublet_score < ${scrublet_threshold}")
      } else if (doublet_method == "doubletFinder") {
        filters <- c(filters, "doubletFinder_doublet_label == 'Singlet'")
      } else if (doublet_method == "doubletCells") {
        filters <- c(filters, "scDblFinder_class == 'singlet'")
      }
      if (ambient_method == "decontX") {
        filters <- c(filters, "decontX_contamination < ${max_decontXsc}")
      }
      return(filters)
    }

    # ── Helper: run SCTK-QC, filter, cluster ─────────────────────
    process_sce_to_seurat <- function(sce) {
      algos <- build_algorithms(doublet_method, ambient_method)
      message("Running QC algorithms: ", paste(algos, collapse = ", "))

      sce <- runCellQC(sce, sample = NULL,
                       algorithms       = algos,
                       mitoRef          = "human",
                       mitoIDType       = "ensembl",
                       mitoGeneLocation = "rownames",
                       seed             = 12345)

      # ── Write per-sample QC table ─────────────────────────────
      sce_temp <- sampleSummaryStats(sce, sample = colData(sce)$sample, simple = FALSE)
      sst      <- getSampleSummaryStatsTable(sce_temp, statsName = "qc_table")
      write.csv(sst, qc_table_out)
      rm(sce_temp)

      # ── Apply filters ─────────────────────────────────────────
      qc_filters <- build_filters(doublet_method, ambient_method)
      message("Applying filters:")
      for (f in qc_filters) message("  ", f)
      sce <- subsetSCECols(sce, colData = qc_filters)
      update_summary("SCTK-QC Results", sce)

      # ── Convert to Seurat and cluster ─────────────────────────
      seurat_obj <- CreateSeuratObject(counts = counts(sce),
                                       meta.data = as.data.frame(colData(sce)),
                                       min.cells = ncol(sce) * 0.005)
      update_summary("Final Seurat Object", seurat_obj)

      seurat_obj <- NormalizeData(seurat_obj, normalization.method = "LogNormalize", scale.factor = 10000)
      seurat_obj <- FindVariableFeatures(seurat_obj)
      seurat_obj <- ScaleData(seurat_obj)
      seurat_obj <- RunPCA(seurat_obj)
      seurat_obj <- RunUMAP(seurat_obj, dims = 1:10, seed.use = 1)
      seurat_obj <- FindNeighbors(seurat_obj)
      seurat_obj <- FindClusters(seurat_obj)
      return(seurat_obj)
    }

    # ── 1. Import ─────────────────────────────────────────────────
    sce <- importCellRanger(cellRangerDirs = input_dir)
    update_summary("Raw Data", sce)

    # ── 2. Remove fake mapping genes ──────────────────────────────
    fake_genes <- grep("^(1|2|3|4|5|6|7|8|9|10|11|12|13|14|15|16|17|18|19|20|21|22|X|Y)_",
                       rownames(sce), value = TRUE)
    sce <- sce[setdiff(rownames(sce), fake_genes), ]
    update_summary("Removing FMGs", sce)

    # ── 3. Convert → filter → QC → cluster ───────────────────────
    seurat_obj       <- convert_to_seurat(sce);              rm(sce);        gc()
    processed_sce    <- process_sce(seurat_obj, samplelist); rm(seurat_obj); gc()
    processed_seurat <- process_sce_to_seurat(processed_sce)

    # ── 4. Save RDS ───────────────────────────────────────────────
    saveRDS(processed_seurat, rds_out)
    message("Saved: ", rds_out)
    rm(processed_seurat); gc()

    qc_csv <- file.path(dirname(rds_out), "QC_summary.csv")
    write.csv(qc_summary, qc_csv, row.names = FALSE)
    message("QC summary saved: ", qc_csv)

[cell_annotation]
parameter: sctk_rds          = str
parameter: output_dir        = str
parameter: seurat_ref        = str
parameter: downsample_n      = 1000
parameter: cluster_threshold = 0.90

import os
input:  []
output: f'{output_dir}/celltyped_seuratobj.rds'
task: trunk_workers = 1, trunk_size = 1, walltime = '36:00:00', mem = '256G', cores = 16
R: expand = "${ }", stdout = f'{_output:n}.stdout', stderr = f'{_output:n}.stderr'

    options(future.globals.maxSize = 16 * 1024^3)
    library(future)
    plan(multicore, workers = 16)
    library(Seurat)
    library(singleCellTK)
    library(tidyverse)
    library(SummarizedExperiment)
    library(scuttle)
    library(SingleR)
    library(org.Hs.eg.db)
    library(dplyr)

    sctk_file         <- "${sctk_rds}"
    output_dir        <- "${output_dir}"
    downsample_n      <- ${downsample_n}
    cluster_threshold <- ${cluster_threshold}

    anno_file <- file.path(output_dir, "annotated_seuratobj.rds")
    out_file  <- "${_output}"

    # ── Create output directory ───────────────────────────────────
    dir.create(output_dir, showWarnings = FALSE, recursive = TRUE)

    get_most_common_label <- function(cell_types, threshold = cluster_threshold) {
      label_table <- table(cell_types)
      most_common <- max(label_table)
      if ((most_common / sum(label_table)) >= threshold) {
        return(names(which.max(label_table)))
      } else {
        return(NA)
      }
    }

    # 1. Load and downsample
    seuratObj <- readRDS(sctk_file)
    seuratObj <- subset(x = seuratObj, downsample = downsample_n)

    # 2. Extract counts, strip Ensembl version suffixes
    sce_count <- GetAssayData(seuratObj[["RNA"]], layer = "counts")
    rownames(sce_count) <- gsub("\\..*$", "", rownames(sce_count))
    rm(seuratObj); gc()

    # 3. Ensembl to HGNC symbol conversion
    IDs <- select(org.Hs.eg.db,
                  keys    = rownames(sce_count),
                  columns = c("ENSEMBL", "SYMBOL"),
                  keytype = "ENSEMBL")
    colnames(IDs) <- c("ensembl_gene_id", "hgnc_symbol")
    IDs <- IDs[!is.na(IDs$hgnc_symbol) & IDs$hgnc_symbol != "", ]
    IDs <- IDs[!duplicated(IDs$hgnc_symbol), ]
    IDs <- IDs[!duplicated(IDs$ensembl_gene_id), ]
    sce_count <- sce_count[rownames(sce_count) %in% IDs$ensembl_gene_id, ]
    rownames(IDs) <- IDs$ensembl_gene_id
    rownames(sce_count) <- IDs[rownames(sce_count), "hgnc_symbol"]
    rm(IDs); gc()

    # 4. Load reference, intersect, free reference immediately after
    seurat_ref_SE <- readRDS("${seurat_ref}")
    common_genes  <- intersect(rownames(sce_count), rownames(seurat_ref_SE))
    ref_subset    <- seurat_ref_SE[common_genes, ]
    rm(seurat_ref_SE); gc()

    sce_count <- sce_count[common_genes, ]
    sce_SE    <- SummarizedExperiment(assays = list(counts = sce_count))
    sce_SE    <- logNormCounts(sce_SE)
    rm(sce_count); gc()

    # 5. SingleR annotation
    singleR_res <- SingleR(test   = sce_SE,
                           ref    = ref_subset,
                           labels = ref_subset$ref_label)
    rm(sce_SE, ref_subset); gc()

    # 6. Add labels to downsampled object
    seuratObj <- readRDS(sctk_file)
    seuratObj <- subset(x = seuratObj, downsample = downsample_n)
    anno_df   <- data.frame(celltype    = singleR_res$labels,
                            column_name = rownames(singleR_res),
                            stringsAsFactors = FALSE)
    seuratObj@meta.data <- seuratObj@meta.data %>%
      inner_join(anno_df, by = "column_name")
    rownames(seuratObj@meta.data) <- seuratObj@meta.data$column_name
    saveRDS(seuratObj, anno_file)
    message("Saved annotated (downsampled): ", anno_file)

    # 7. Transfer labels to full object
    full_obj <- readRDS(sctk_file)
    full_obj@meta.data$cell <- rownames(full_obj@meta.data)

    singleR_data <- data.frame(
      celltype        = seuratObj@meta.data$celltype,
      seurat_clusters = seuratObj@meta.data$seurat_clusters
    )
    rm(seuratObj); gc()

    most_common_labels <- singleR_data %>%
      group_by(seurat_clusters) %>%
      summarise(most_common_label = get_most_common_label(celltype), .groups = "drop")

    merged_meta  <- merge(full_obj@meta.data, most_common_labels,
                          by = "seurat_clusters", all.x = TRUE)
    ordered_meta <- merged_meta[match(colnames(full_obj@assays$RNA), merged_meta$cell), ]

    full_obj@meta.data           <- ordered_meta
    rownames(full_obj@meta.data) <- full_obj@meta.data$cell
    full_obj@meta.data$celltype  <- full_obj@meta.data$most_common_label
    Idents(full_obj)             <- full_obj@meta.data$celltype

    saveRDS(full_obj, out_file)
    message("Saved celltyped: ", out_file)
    rm(full_obj, singleR_res, anno_df, singleR_data,
       most_common_labels, merged_meta, ordered_meta); gc()