snRNA-seq Preprocessing QC Pipeline#

Overview#

This pipeline performs quality control on single-nuclei RNA-seq data using SCTK-QC and Seurat.

Goals:

  • Remove low-quality cells using mitochondrial content, count depth (nUMI), and gene detection thresholds (nFeature)

  • Filter ambient RNA contamination (decontX)

  • Remove doublets using a user-selected method (scds, scrublet, doubletFinder, doubletCells)

  • Generate a QC-filtered, clustered Seurat object

Input Files#

File

Description

CellRanger output directory

Path to a single CellRanger output directory

ROSMAP_snRNAseq_demultiplexed_ID_mapping.csv

Sample barcode-to-patient ID mapping (columns: libraryBatch, cellBarcode, individualID)

Parameters#

Parameter

Default

Description

input_dir

required

Path to CellRanger output directory

output_dir

required

Output directory

sample_meta

required

Path to demultiplexed sample ID mapping CSV

max_mito_percent

5

Maximum mitochondrial read percentage

min_ncounts

250

Minimum total counts per cell (nUMI)

min_ngenes

500

Minimum detected genes per cell (nFeature)

doublet_method

scds

Doublet detection method: scds, scrublet, doubletFinder, doubletCells, or none

scds_call

Singlet

Required SCDS call label (used when doublet_method = 'scds')

scrublet_threshold

0.25

Scrublet score cutoff (used when doublet_method = 'scrublet')

ambient_method

decontX

Ambient RNA removal method: decontX or none

max_decontXsc

0.5

Maximum decontX contamination score (used when ambient_method = 'decontX')

Output Files#

File

Description

SCTK_results/filtered_seuratobj.rds

QC-filtered, clustered Seurat object

QC_table/SCTK_QC_table.csv

Per-sample SCTK QC metrics

SCTK_results/QC_summary.csv

Cell/gene counts at each QC step

Timing: < 10 min per batch

Minimal Working Example#

sos run pipeline/snRNAseq_preprocessing.ipynb sctk_qc \
    --input-dir data/snrna_seq/grouptest \
    --output-dir output/snrna_seq \
    --sample-meta data/snrna_seq/ROSMAP_snRNAseq_demultiplexed_ID_mapping.csv 

Cell Type Annotation#

Overview#

This pipeline annotates cell types in QC-filtered single-nuclei RNA-seq data using SingleR and transfers cluster-level labels to full Seurat objects.

Goals:

  • Annotate cell types on downsampled data using a curated reference

  • Transfer cluster-majority labels to full Seurat objects

  • Generate celltype-labeled Seurat objects ready for pseudobulk aggregation

Input Files#

File

Description

filtered_seuratobj.rds

QC-filtered Seurat object from sctk_qc

seurat_ref_SE.rds

SingleR reference SummarizedExperiment object

Parameters#

Parameter

Default

Description

sctk_rds

required

Path to filtered_seuratobj.rds from sctk_qc

output_dir

required

Output directory

seurat_ref

required

Path to seurat_ref_SE.rds reference object

downsample_n

1000

Cells per cluster to downsample before SingleR

cluster_threshold

0.90

Minimum fraction of cells required to assign a cluster-level label

Output Files#

File

Description

annotated_seuratobj.rds

Downsampled Seurat object with per-cell SingleR labels

celltyped_seuratobj.rds

Full Seurat object with cluster-majority cell type labels

Timing: >30 min per seurat object

Minimal Working Example#

sos run pipeline/snRNAseq_preprocessing.ipynb cell_annotation \
    --sctk-rds output/snrna_seq/SCTK_results/filtered_seuratobj.rds \
    --output-dir output/cell_annotation \
    --seurat-ref data/snrna_seq/seurat_ref_SE.rds

Command interface#

sos run pipeline/snRNAseq_preprocessing.ipynb -h
[workflow_name | -t targets] [options] [workflow_options]
 workflow_name:        Single or combined workflows defined in this script
 targets:              One or more targets to generate
 options:              Single-hyphen sos parameters (see "sos run -h" for details)
 workflow_options:     Double-hyphen workflow-specific parameters

Workflows:
 sctk_qc
 cell_annotation

Global Workflow Options:
 --cwd output (as path)
 --job-size 1 (as int)
 --walltime 5h
 --mem 16G
 --numThreads 8 (as int)
 --container ''

Sections
 sctk_qc:
   Workflow Options:
     --input-dir VAL (as str, required)
     --output-dir VAL (as str, required)
     --sample-meta VAL (as str, required)
     --max-mito-percent 5 (as int)
     --min-ncounts 250 (as int)
     --min-ngenes 500 (as int)
     --doublet-method scds
     --scds-call Singlet
     --scrublet-threshold 0.25 (as float)
     --ambient-method decontX
     --max-decontXsc 0.5 (as float)
 cell_annotation:
   Workflow Options:
     --sctk-rds VAL (as str, required)
     --output-dir VAL (as str, required)
     --seurat-ref VAL (as str, required)
     --downsample-n 1000 (as int)
     --cluster-threshold 0.9 (as float)

Setup and global parameters#

[global]
parameter: cwd = path("output")
parameter: job_size = 1
parameter: walltime = "5h"
parameter: mem = "16G"
parameter: numThreads = 8
parameter: container = ""

import re
from sos.utils import expand_size

) if container else ""

cwd = path(f'{cwd:a}')

sctk_qc#

[sctk_qc]
parameter: input_dir          = str
parameter: output_dir         = str
parameter: sample_meta        = str
parameter: max_mito_percent   = 5
parameter: min_ncounts        = 250
parameter: min_ngenes         = 500
parameter: doublet_method     = 'scds'      # options: 'scds', 'scrublet', 'doubletFinder', 'doubletCells', 'none'
parameter: scds_call          = 'Singlet'
parameter: scrublet_threshold = 0.25
parameter: ambient_method     = 'decontX'   # options: 'decontX', 'none'
parameter: max_decontXsc      = 0.5

import os
input:  []
output: f'{output_dir}/SCTK_results/filtered_seuratobj.rds',
        f'{output_dir}/QC_table/SCTK_QC_table.csv'
task: trunk_workers = 1, trunk_size = 1, walltime = '36:00:00', mem = '256G', cores = 16
R: expand = "${ }", stdout = f'{_output[0]:n}.stdout', stderr = f'{_output[0]:n}.stderr'

    options(future.globals.maxSize = 8 * 1024^3)
    library(future)
    plan(multicore, workers = 16)

    library(Seurat)
    library(singleCellTK)
    library(tidyverse)
    library(SummarizedExperiment)
    library(scuttle)
    library(dplyr)
    library(SCOPfunctions)

    input_dir      <- "${input_dir}"
    qc_table_out   <- "${_output[1]}"
    rds_out        <- "${_output[0]}"
    doublet_method <- "${doublet_method}"
    ambient_method <- "${ambient_method}"

    # Validate method choices
    valid_doublet <- c("scds", "scrublet", "doubletFinder", "doubletCells", "none")
    valid_ambient <- c("decontX", "none")
    if (!doublet_method %in% valid_doublet)
      stop("Invalid doublet_method '", doublet_method, "'. Choose from: ", paste(valid_doublet, collapse = ", "))
    if (!ambient_method %in% valid_ambient)
      stop("Invalid ambient_method '", ambient_method, "'. Choose from: ", paste(valid_ambient, collapse = ", "))

    message("Doublet detection method : ", doublet_method)
    message("Ambient RNA removal method: ", ambient_method)

    # ── Create output directories ─────────────────────────────────
    dir.create(dirname(rds_out),      showWarnings = FALSE, recursive = TRUE)
    dir.create(dirname(qc_table_out), showWarnings = FALSE, recursive = TRUE)

    # ── QC summary table ──────────────────────────────────────────
    qc_summary <- data.frame(Step = character(),
                              Cells = numeric(), Genes = numeric(),
                              stringsAsFactors = FALSE)
    update_summary <- function(step, obj) {
      qc_summary <<- rbind(qc_summary, data.frame(
        Step  = step,
        Cells = ncol(obj), Genes = nrow(obj)))
    }

    # ── Sample metadata ───────────────────────────────────────────
    samplelist <- read.csv("${sample_meta}")
    samplelist$column_name <- paste0(samplelist$libraryBatch, "-counts_", samplelist$cellBarcode)
    samplelist <- samplelist[, c("individualID", "column_name")]
    colnames(samplelist) <- c("sample", "column_name")

    # ── Helper: convert SCE to Seurat ─────────────────────────────
    convert_to_seurat <- function(sce) {
      obj <- CreateSeuratObject(counts = counts(sce),
                                meta.data = as.data.frame(colData(sce)),
                                min.cells = ncol(sce) * 0.005)
      update_summary("Removing Genes expressed in < 0.5% of cells", obj)
      return(obj)
    }

    # ── Helper: assign sample IDs, convert back to SCE ────────────
    process_sce <- function(seurat_obj, samplelist) {
      seurat_obj <- seurat_obj[, colnames(seurat_obj) %in% samplelist$column_name]
      update_summary("Removing Cells with no associated Patient ID", seurat_obj)
      seurat_obj@meta.data$sample <- NULL
      seurat_obj@meta.data <- seurat_obj@meta.data %>%
        inner_join(samplelist, by = "column_name")
      seurat_obj@meta.data <- seurat_obj@meta.data[
        !duplicated(seurat_obj@meta.data$column_name), ]
      rownames(seurat_obj@meta.data) <- seurat_obj@meta.data$column_name
      return(as.SingleCellExperiment(seurat_obj))
    }

    # ── Helper: build QC algorithms list from chosen methods ──────
    # Doublet detection methods (runCellQC YAML names):
    #   "scds"          → cxds_bcds_hybrid  (cxds + bcds hybrid score)
    #   "scrublet"      → scrublet
    #   "doubletFinder" → doubletFinder
    #   "doubletCells"  → doubletCells      (scran-based)
    #   "none"          → skip doublet detection
    # Ambient RNA methods:
    #   "decontX"       → decontX
    #   "none"          → skip ambient RNA removal
    build_algorithms <- function(doublet_method, ambient_method) {
      algos <- "QCMetrics"
      if (doublet_method == "scds")          algos <- c(algos, "cxds_bcds_hybrid")
      if (doublet_method == "scrublet")      algos <- c(algos, "scrublet")
      if (doublet_method == "doubletFinder") algos <- c(algos, "doubletFinder")
      if (doublet_method == "doubletCells")  algos <- c(algos, "doubletCells")
      if (ambient_method == "decontX")       algos <- c(algos, "decontX")
      return(algos)
    }

    # ── Helper: build filter strings from chosen methods ──────────
    build_filters <- function(doublet_method, ambient_method) {
      filters <- c(
        "mito_percent < ${max_mito_percent}",
        "total > ${min_ncounts}",
        "detected > ${min_ngenes}"
      )
      if (doublet_method == "scds") {
        filters <- c(filters, "scds_hybrid_call=='${scds_call}'")
      } else if (doublet_method == "scrublet") {
        filters <- c(filters, "scrublet_score < ${scrublet_threshold}")
      } else if (doublet_method == "doubletFinder") {
        filters <- c(filters, "doubletFinder_doublet_label == 'Singlet'")
      } else if (doublet_method == "doubletCells") {
        filters <- c(filters, "scDblFinder_class == 'singlet'")
      }
      if (ambient_method == "decontX") {
        filters <- c(filters, "decontX_contamination < ${max_decontXsc}")
      }
      return(filters)
    }

    # ── Helper: run SCTK-QC, filter, cluster ─────────────────────
    process_sce_to_seurat <- function(sce) {
      algos <- build_algorithms(doublet_method, ambient_method)
      message("Running QC algorithms: ", paste(algos, collapse = ", "))

      sce <- runCellQC(sce, sample = NULL,
                       algorithms       = algos,
                       mitoRef          = "human",
                       mitoIDType       = "ensembl",
                       mitoGeneLocation = "rownames",
                       seed             = 12345)

      # ── Write per-sample QC table ─────────────────────────────
      sce_temp <- sampleSummaryStats(sce, sample = colData(sce)$sample, simple = FALSE)
      sst      <- getSampleSummaryStatsTable(sce_temp, statsName = "qc_table")
      write.csv(sst, qc_table_out)
      rm(sce_temp)

      # ── Apply filters ─────────────────────────────────────────
      qc_filters <- build_filters(doublet_method, ambient_method)
      message("Applying filters:")
      for (f in qc_filters) message("  ", f)
      sce <- subsetSCECols(sce, colData = qc_filters)
      update_summary("SCTK-QC Results", sce)

      # ── Convert to Seurat and cluster ─────────────────────────
      seurat_obj <- CreateSeuratObject(counts = counts(sce),
                                       meta.data = as.data.frame(colData(sce)),
                                       min.cells = ncol(sce) * 0.005)
      update_summary("Final Seurat Object", seurat_obj)

      seurat_obj <- NormalizeData(seurat_obj, normalization.method = "LogNormalize", scale.factor = 10000)
      seurat_obj <- FindVariableFeatures(seurat_obj)
      seurat_obj <- ScaleData(seurat_obj)
      seurat_obj <- RunPCA(seurat_obj)
      seurat_obj <- RunUMAP(seurat_obj, dims = 1:10, seed.use = 1)
      seurat_obj <- FindNeighbors(seurat_obj)
      seurat_obj <- FindClusters(seurat_obj)
      return(seurat_obj)
    }

    # ── 1. Import ─────────────────────────────────────────────────
    sce <- importCellRanger(cellRangerDirs = input_dir)
    update_summary("Raw Data", sce)

    # ── 2. Remove fake mapping genes ──────────────────────────────
    fake_genes <- grep("^(1|2|3|4|5|6|7|8|9|10|11|12|13|14|15|16|17|18|19|20|21|22|X|Y)_",
                       rownames(sce), value = TRUE)
    sce <- sce[setdiff(rownames(sce), fake_genes), ]
    update_summary("Removing FMGs", sce)

    # ── 3. Convert → filter → QC → cluster ───────────────────────
    seurat_obj       <- convert_to_seurat(sce);              rm(sce);        gc()
    processed_sce    <- process_sce(seurat_obj, samplelist); rm(seurat_obj); gc()
    processed_seurat <- process_sce_to_seurat(processed_sce)

    # ── 4. Save RDS ───────────────────────────────────────────────
    saveRDS(processed_seurat, rds_out)
    message("Saved: ", rds_out)
    rm(processed_seurat); gc()

    qc_csv <- file.path(dirname(rds_out), "QC_summary.csv")
    write.csv(qc_summary, qc_csv, row.names = FALSE)
    message("QC summary saved: ", qc_csv)

cell_annotation#

[cell_annotation]
parameter: sctk_rds          = str
parameter: output_dir        = str
parameter: seurat_ref        = str
parameter: downsample_n      = 1000
parameter: cluster_threshold = 0.90

import os
input:  []
output: f'{output_dir}/celltyped_seuratobj.rds'
task: trunk_workers = 1, trunk_size = 1, walltime = '36:00:00', mem = '256G', cores = 16
R: expand = "${ }", stdout = f'{_output:n}.stdout', stderr = f'{_output:n}.stderr'

    options(future.globals.maxSize = 16 * 1024^3)
    library(future)
    plan(multicore, workers = 16)
    library(Seurat)
    library(singleCellTK)
    library(tidyverse)
    library(SummarizedExperiment)
    library(scuttle)
    library(SingleR)
    library(org.Hs.eg.db)
    library(dplyr)

    sctk_file         <- "${sctk_rds}"
    output_dir        <- "${output_dir}"
    downsample_n      <- ${downsample_n}
    cluster_threshold <- ${cluster_threshold}

    anno_file <- file.path(output_dir, "annotated_seuratobj.rds")
    out_file  <- "${_output}"

    # ── Create output directory ───────────────────────────────────
    dir.create(output_dir, showWarnings = FALSE, recursive = TRUE)

    get_most_common_label <- function(cell_types, threshold = cluster_threshold) {
      label_table <- table(cell_types)
      most_common <- max(label_table)
      if ((most_common / sum(label_table)) >= threshold) {
        return(names(which.max(label_table)))
      } else {
        return(NA)
      }
    }

    # 1. Load and downsample
    seuratObj <- readRDS(sctk_file)
    seuratObj <- subset(x = seuratObj, downsample = downsample_n)

    # 2. Extract counts, strip Ensembl version suffixes
    sce_count <- GetAssayData(seuratObj[["RNA"]], layer = "counts")
    rownames(sce_count) <- gsub("\\..*$", "", rownames(sce_count))
    rm(seuratObj); gc()

    # 3. Ensembl to HGNC symbol conversion
    IDs <- select(org.Hs.eg.db,
                  keys    = rownames(sce_count),
                  columns = c("ENSEMBL", "SYMBOL"),
                  keytype = "ENSEMBL")
    colnames(IDs) <- c("ensembl_gene_id", "hgnc_symbol")
    IDs <- IDs[!is.na(IDs$hgnc_symbol) & IDs$hgnc_symbol != "", ]
    IDs <- IDs[!duplicated(IDs$hgnc_symbol), ]
    IDs <- IDs[!duplicated(IDs$ensembl_gene_id), ]
    sce_count <- sce_count[rownames(sce_count) %in% IDs$ensembl_gene_id, ]
    rownames(IDs) <- IDs$ensembl_gene_id
    rownames(sce_count) <- IDs[rownames(sce_count), "hgnc_symbol"]
    rm(IDs); gc()

    # 4. Load reference, intersect, free reference immediately after
    seurat_ref_SE <- readRDS("${seurat_ref}")
    common_genes  <- intersect(rownames(sce_count), rownames(seurat_ref_SE))
    ref_subset    <- seurat_ref_SE[common_genes, ]
    rm(seurat_ref_SE); gc()

    sce_count <- sce_count[common_genes, ]
    sce_SE    <- SummarizedExperiment(assays = list(counts = sce_count))
    sce_SE    <- logNormCounts(sce_SE)
    rm(sce_count); gc()

    # 5. SingleR annotation
    singleR_res <- SingleR(test   = sce_SE,
                           ref    = ref_subset,
                           labels = ref_subset$ref_label)
    rm(sce_SE, ref_subset); gc()

    # 6. Add labels to downsampled object
    seuratObj <- readRDS(sctk_file)
    seuratObj <- subset(x = seuratObj, downsample = downsample_n)
    anno_df   <- data.frame(celltype    = singleR_res$labels,
                            column_name = rownames(singleR_res),
                            stringsAsFactors = FALSE)
    seuratObj@meta.data <- seuratObj@meta.data %>%
      inner_join(anno_df, by = "column_name")
    rownames(seuratObj@meta.data) <- seuratObj@meta.data$column_name
    saveRDS(seuratObj, anno_file)
    message("Saved annotated (downsampled): ", anno_file)

    # 7. Transfer labels to full object
    full_obj <- readRDS(sctk_file)
    full_obj@meta.data$cell <- rownames(full_obj@meta.data)

    singleR_data <- data.frame(
      celltype        = seuratObj@meta.data$celltype,
      seurat_clusters = seuratObj@meta.data$seurat_clusters
    )
    rm(seuratObj); gc()

    most_common_labels <- singleR_data %>%
      group_by(seurat_clusters) %>%
      summarise(most_common_label = get_most_common_label(celltype), .groups = "drop")

    merged_meta  <- merge(full_obj@meta.data, most_common_labels,
                          by = "seurat_clusters", all.x = TRUE)
    ordered_meta <- merged_meta[match(colnames(full_obj@assays$RNA), merged_meta$cell), ]

    full_obj@meta.data           <- ordered_meta
    rownames(full_obj@meta.data) <- full_obj@meta.data$cell
    full_obj@meta.data$celltype  <- full_obj@meta.data$most_common_label
    Idents(full_obj)             <- full_obj@meta.data$celltype

    saveRDS(full_obj, out_file)
    message("Saved celltyped: ", out_file)
    rm(full_obj, singleR_res, anno_df, singleR_data,
       most_common_labels, merged_meta, ordered_meta); gc()