snRNA-seq Preprocessing QC Pipeline#
Overview#
This pipeline performs quality control on single-nuclei RNA-seq data using SCTK-QC and Seurat.
Goals:
Remove low-quality cells using mitochondrial content, count depth (nUMI), and gene detection thresholds (nFeature)
Filter ambient RNA contamination (decontX)
Remove doublets using a user-selected method (scds, scrublet, doubletFinder, doubletCells)
Generate a QC-filtered, clustered Seurat object
Input Files#
File |
Description |
|---|---|
CellRanger output directory |
Path to a single CellRanger output directory |
|
Sample barcode-to-patient ID mapping (columns: |
Parameters#
Parameter |
Default |
Description |
|---|---|---|
|
required |
Path to CellRanger output directory |
|
required |
Output directory |
|
required |
Path to demultiplexed sample ID mapping CSV |
|
|
Maximum mitochondrial read percentage |
|
|
Minimum total counts per cell (nUMI) |
|
|
Minimum detected genes per cell (nFeature) |
|
|
Doublet detection method: |
|
|
Required SCDS call label (used when |
|
|
Scrublet score cutoff (used when |
|
|
Ambient RNA removal method: |
|
|
Maximum decontX contamination score (used when |
Output Files#
File |
Description |
|---|---|
|
QC-filtered, clustered Seurat object |
|
Per-sample SCTK QC metrics |
|
Cell/gene counts at each QC step |
Timing: < 10 min per batch
Minimal Working Example#
sos run pipeline/snRNAseq_preprocessing.ipynb sctk_qc \
--input-dir data/snrna_seq/grouptest \
--output-dir output/snrna_seq \
--sample-meta data/snrna_seq/ROSMAP_snRNAseq_demultiplexed_ID_mapping.csv
Cell Type Annotation#
Overview#
This pipeline annotates cell types in QC-filtered single-nuclei RNA-seq data using SingleR and transfers cluster-level labels to full Seurat objects.
Goals:
Annotate cell types on downsampled data using a curated reference
Transfer cluster-majority labels to full Seurat objects
Generate celltype-labeled Seurat objects ready for pseudobulk aggregation
Input Files#
File |
Description |
|---|---|
|
QC-filtered Seurat object from |
|
SingleR reference SummarizedExperiment object |
Parameters#
Parameter |
Default |
Description |
|---|---|---|
|
required |
Path to |
|
required |
Output directory |
|
required |
Path to |
|
|
Cells per cluster to downsample before SingleR |
|
|
Minimum fraction of cells required to assign a cluster-level label |
Output Files#
File |
Description |
|---|---|
|
Downsampled Seurat object with per-cell SingleR labels |
|
Full Seurat object with cluster-majority cell type labels |
Timing: >30 min per seurat object
Minimal Working Example#
sos run pipeline/snRNAseq_preprocessing.ipynb cell_annotation \
--sctk-rds output/snrna_seq/SCTK_results/filtered_seuratobj.rds \
--output-dir output/cell_annotation \
--seurat-ref data/snrna_seq/seurat_ref_SE.rds
Command interface#
sos run pipeline/snRNAseq_preprocessing.ipynb -h
[workflow_name | -t targets] [options] [workflow_options]
workflow_name: Single or combined workflows defined in this script
targets: One or more targets to generate
options: Single-hyphen sos parameters (see "sos run -h" for details)
workflow_options: Double-hyphen workflow-specific parameters
Workflows:
sctk_qc
cell_annotation
Global Workflow Options:
--cwd output (as path)
--job-size 1 (as int)
--walltime 5h
--mem 16G
--numThreads 8 (as int)
--container ''
Sections
sctk_qc:
Workflow Options:
--input-dir VAL (as str, required)
--output-dir VAL (as str, required)
--sample-meta VAL (as str, required)
--max-mito-percent 5 (as int)
--min-ncounts 250 (as int)
--min-ngenes 500 (as int)
--doublet-method scds
--scds-call Singlet
--scrublet-threshold 0.25 (as float)
--ambient-method decontX
--max-decontXsc 0.5 (as float)
cell_annotation:
Workflow Options:
--sctk-rds VAL (as str, required)
--output-dir VAL (as str, required)
--seurat-ref VAL (as str, required)
--downsample-n 1000 (as int)
--cluster-threshold 0.9 (as float)
Setup and global parameters#
[global]
parameter: cwd = path("output")
parameter: job_size = 1
parameter: walltime = "5h"
parameter: mem = "16G"
parameter: numThreads = 8
parameter: container = ""
import re
from sos.utils import expand_size
) if container else ""
cwd = path(f'{cwd:a}')
sctk_qc#
[sctk_qc]
parameter: input_dir = str
parameter: output_dir = str
parameter: sample_meta = str
parameter: max_mito_percent = 5
parameter: min_ncounts = 250
parameter: min_ngenes = 500
parameter: doublet_method = 'scds' # options: 'scds', 'scrublet', 'doubletFinder', 'doubletCells', 'none'
parameter: scds_call = 'Singlet'
parameter: scrublet_threshold = 0.25
parameter: ambient_method = 'decontX' # options: 'decontX', 'none'
parameter: max_decontXsc = 0.5
import os
input: []
output: f'{output_dir}/SCTK_results/filtered_seuratobj.rds',
f'{output_dir}/QC_table/SCTK_QC_table.csv'
task: trunk_workers = 1, trunk_size = 1, walltime = '36:00:00', mem = '256G', cores = 16
R: expand = "${ }", stdout = f'{_output[0]:n}.stdout', stderr = f'{_output[0]:n}.stderr'
options(future.globals.maxSize = 8 * 1024^3)
library(future)
plan(multicore, workers = 16)
library(Seurat)
library(singleCellTK)
library(tidyverse)
library(SummarizedExperiment)
library(scuttle)
library(dplyr)
library(SCOPfunctions)
input_dir <- "${input_dir}"
qc_table_out <- "${_output[1]}"
rds_out <- "${_output[0]}"
doublet_method <- "${doublet_method}"
ambient_method <- "${ambient_method}"
# Validate method choices
valid_doublet <- c("scds", "scrublet", "doubletFinder", "doubletCells", "none")
valid_ambient <- c("decontX", "none")
if (!doublet_method %in% valid_doublet)
stop("Invalid doublet_method '", doublet_method, "'. Choose from: ", paste(valid_doublet, collapse = ", "))
if (!ambient_method %in% valid_ambient)
stop("Invalid ambient_method '", ambient_method, "'. Choose from: ", paste(valid_ambient, collapse = ", "))
message("Doublet detection method : ", doublet_method)
message("Ambient RNA removal method: ", ambient_method)
# ── Create output directories ─────────────────────────────────
dir.create(dirname(rds_out), showWarnings = FALSE, recursive = TRUE)
dir.create(dirname(qc_table_out), showWarnings = FALSE, recursive = TRUE)
# ── QC summary table ──────────────────────────────────────────
qc_summary <- data.frame(Step = character(),
Cells = numeric(), Genes = numeric(),
stringsAsFactors = FALSE)
update_summary <- function(step, obj) {
qc_summary <<- rbind(qc_summary, data.frame(
Step = step,
Cells = ncol(obj), Genes = nrow(obj)))
}
# ── Sample metadata ───────────────────────────────────────────
samplelist <- read.csv("${sample_meta}")
samplelist$column_name <- paste0(samplelist$libraryBatch, "-counts_", samplelist$cellBarcode)
samplelist <- samplelist[, c("individualID", "column_name")]
colnames(samplelist) <- c("sample", "column_name")
# ── Helper: convert SCE to Seurat ─────────────────────────────
convert_to_seurat <- function(sce) {
obj <- CreateSeuratObject(counts = counts(sce),
meta.data = as.data.frame(colData(sce)),
min.cells = ncol(sce) * 0.005)
update_summary("Removing Genes expressed in < 0.5% of cells", obj)
return(obj)
}
# ── Helper: assign sample IDs, convert back to SCE ────────────
process_sce <- function(seurat_obj, samplelist) {
seurat_obj <- seurat_obj[, colnames(seurat_obj) %in% samplelist$column_name]
update_summary("Removing Cells with no associated Patient ID", seurat_obj)
seurat_obj@meta.data$sample <- NULL
seurat_obj@meta.data <- seurat_obj@meta.data %>%
inner_join(samplelist, by = "column_name")
seurat_obj@meta.data <- seurat_obj@meta.data[
!duplicated(seurat_obj@meta.data$column_name), ]
rownames(seurat_obj@meta.data) <- seurat_obj@meta.data$column_name
return(as.SingleCellExperiment(seurat_obj))
}
# ── Helper: build QC algorithms list from chosen methods ──────
# Doublet detection methods (runCellQC YAML names):
# "scds" → cxds_bcds_hybrid (cxds + bcds hybrid score)
# "scrublet" → scrublet
# "doubletFinder" → doubletFinder
# "doubletCells" → doubletCells (scran-based)
# "none" → skip doublet detection
# Ambient RNA methods:
# "decontX" → decontX
# "none" → skip ambient RNA removal
build_algorithms <- function(doublet_method, ambient_method) {
algos <- "QCMetrics"
if (doublet_method == "scds") algos <- c(algos, "cxds_bcds_hybrid")
if (doublet_method == "scrublet") algos <- c(algos, "scrublet")
if (doublet_method == "doubletFinder") algos <- c(algos, "doubletFinder")
if (doublet_method == "doubletCells") algos <- c(algos, "doubletCells")
if (ambient_method == "decontX") algos <- c(algos, "decontX")
return(algos)
}
# ── Helper: build filter strings from chosen methods ──────────
build_filters <- function(doublet_method, ambient_method) {
filters <- c(
"mito_percent < ${max_mito_percent}",
"total > ${min_ncounts}",
"detected > ${min_ngenes}"
)
if (doublet_method == "scds") {
filters <- c(filters, "scds_hybrid_call=='${scds_call}'")
} else if (doublet_method == "scrublet") {
filters <- c(filters, "scrublet_score < ${scrublet_threshold}")
} else if (doublet_method == "doubletFinder") {
filters <- c(filters, "doubletFinder_doublet_label == 'Singlet'")
} else if (doublet_method == "doubletCells") {
filters <- c(filters, "scDblFinder_class == 'singlet'")
}
if (ambient_method == "decontX") {
filters <- c(filters, "decontX_contamination < ${max_decontXsc}")
}
return(filters)
}
# ── Helper: run SCTK-QC, filter, cluster ─────────────────────
process_sce_to_seurat <- function(sce) {
algos <- build_algorithms(doublet_method, ambient_method)
message("Running QC algorithms: ", paste(algos, collapse = ", "))
sce <- runCellQC(sce, sample = NULL,
algorithms = algos,
mitoRef = "human",
mitoIDType = "ensembl",
mitoGeneLocation = "rownames",
seed = 12345)
# ── Write per-sample QC table ─────────────────────────────
sce_temp <- sampleSummaryStats(sce, sample = colData(sce)$sample, simple = FALSE)
sst <- getSampleSummaryStatsTable(sce_temp, statsName = "qc_table")
write.csv(sst, qc_table_out)
rm(sce_temp)
# ── Apply filters ─────────────────────────────────────────
qc_filters <- build_filters(doublet_method, ambient_method)
message("Applying filters:")
for (f in qc_filters) message(" ", f)
sce <- subsetSCECols(sce, colData = qc_filters)
update_summary("SCTK-QC Results", sce)
# ── Convert to Seurat and cluster ─────────────────────────
seurat_obj <- CreateSeuratObject(counts = counts(sce),
meta.data = as.data.frame(colData(sce)),
min.cells = ncol(sce) * 0.005)
update_summary("Final Seurat Object", seurat_obj)
seurat_obj <- NormalizeData(seurat_obj, normalization.method = "LogNormalize", scale.factor = 10000)
seurat_obj <- FindVariableFeatures(seurat_obj)
seurat_obj <- ScaleData(seurat_obj)
seurat_obj <- RunPCA(seurat_obj)
seurat_obj <- RunUMAP(seurat_obj, dims = 1:10, seed.use = 1)
seurat_obj <- FindNeighbors(seurat_obj)
seurat_obj <- FindClusters(seurat_obj)
return(seurat_obj)
}
# ── 1. Import ─────────────────────────────────────────────────
sce <- importCellRanger(cellRangerDirs = input_dir)
update_summary("Raw Data", sce)
# ── 2. Remove fake mapping genes ──────────────────────────────
fake_genes <- grep("^(1|2|3|4|5|6|7|8|9|10|11|12|13|14|15|16|17|18|19|20|21|22|X|Y)_",
rownames(sce), value = TRUE)
sce <- sce[setdiff(rownames(sce), fake_genes), ]
update_summary("Removing FMGs", sce)
# ── 3. Convert → filter → QC → cluster ───────────────────────
seurat_obj <- convert_to_seurat(sce); rm(sce); gc()
processed_sce <- process_sce(seurat_obj, samplelist); rm(seurat_obj); gc()
processed_seurat <- process_sce_to_seurat(processed_sce)
# ── 4. Save RDS ───────────────────────────────────────────────
saveRDS(processed_seurat, rds_out)
message("Saved: ", rds_out)
rm(processed_seurat); gc()
qc_csv <- file.path(dirname(rds_out), "QC_summary.csv")
write.csv(qc_summary, qc_csv, row.names = FALSE)
message("QC summary saved: ", qc_csv)
cell_annotation#
[cell_annotation]
parameter: sctk_rds = str
parameter: output_dir = str
parameter: seurat_ref = str
parameter: downsample_n = 1000
parameter: cluster_threshold = 0.90
import os
input: []
output: f'{output_dir}/celltyped_seuratobj.rds'
task: trunk_workers = 1, trunk_size = 1, walltime = '36:00:00', mem = '256G', cores = 16
R: expand = "${ }", stdout = f'{_output:n}.stdout', stderr = f'{_output:n}.stderr'
options(future.globals.maxSize = 16 * 1024^3)
library(future)
plan(multicore, workers = 16)
library(Seurat)
library(singleCellTK)
library(tidyverse)
library(SummarizedExperiment)
library(scuttle)
library(SingleR)
library(org.Hs.eg.db)
library(dplyr)
sctk_file <- "${sctk_rds}"
output_dir <- "${output_dir}"
downsample_n <- ${downsample_n}
cluster_threshold <- ${cluster_threshold}
anno_file <- file.path(output_dir, "annotated_seuratobj.rds")
out_file <- "${_output}"
# ── Create output directory ───────────────────────────────────
dir.create(output_dir, showWarnings = FALSE, recursive = TRUE)
get_most_common_label <- function(cell_types, threshold = cluster_threshold) {
label_table <- table(cell_types)
most_common <- max(label_table)
if ((most_common / sum(label_table)) >= threshold) {
return(names(which.max(label_table)))
} else {
return(NA)
}
}
# 1. Load and downsample
seuratObj <- readRDS(sctk_file)
seuratObj <- subset(x = seuratObj, downsample = downsample_n)
# 2. Extract counts, strip Ensembl version suffixes
sce_count <- GetAssayData(seuratObj[["RNA"]], layer = "counts")
rownames(sce_count) <- gsub("\\..*$", "", rownames(sce_count))
rm(seuratObj); gc()
# 3. Ensembl to HGNC symbol conversion
IDs <- select(org.Hs.eg.db,
keys = rownames(sce_count),
columns = c("ENSEMBL", "SYMBOL"),
keytype = "ENSEMBL")
colnames(IDs) <- c("ensembl_gene_id", "hgnc_symbol")
IDs <- IDs[!is.na(IDs$hgnc_symbol) & IDs$hgnc_symbol != "", ]
IDs <- IDs[!duplicated(IDs$hgnc_symbol), ]
IDs <- IDs[!duplicated(IDs$ensembl_gene_id), ]
sce_count <- sce_count[rownames(sce_count) %in% IDs$ensembl_gene_id, ]
rownames(IDs) <- IDs$ensembl_gene_id
rownames(sce_count) <- IDs[rownames(sce_count), "hgnc_symbol"]
rm(IDs); gc()
# 4. Load reference, intersect, free reference immediately after
seurat_ref_SE <- readRDS("${seurat_ref}")
common_genes <- intersect(rownames(sce_count), rownames(seurat_ref_SE))
ref_subset <- seurat_ref_SE[common_genes, ]
rm(seurat_ref_SE); gc()
sce_count <- sce_count[common_genes, ]
sce_SE <- SummarizedExperiment(assays = list(counts = sce_count))
sce_SE <- logNormCounts(sce_SE)
rm(sce_count); gc()
# 5. SingleR annotation
singleR_res <- SingleR(test = sce_SE,
ref = ref_subset,
labels = ref_subset$ref_label)
rm(sce_SE, ref_subset); gc()
# 6. Add labels to downsampled object
seuratObj <- readRDS(sctk_file)
seuratObj <- subset(x = seuratObj, downsample = downsample_n)
anno_df <- data.frame(celltype = singleR_res$labels,
column_name = rownames(singleR_res),
stringsAsFactors = FALSE)
seuratObj@meta.data <- seuratObj@meta.data %>%
inner_join(anno_df, by = "column_name")
rownames(seuratObj@meta.data) <- seuratObj@meta.data$column_name
saveRDS(seuratObj, anno_file)
message("Saved annotated (downsampled): ", anno_file)
# 7. Transfer labels to full object
full_obj <- readRDS(sctk_file)
full_obj@meta.data$cell <- rownames(full_obj@meta.data)
singleR_data <- data.frame(
celltype = seuratObj@meta.data$celltype,
seurat_clusters = seuratObj@meta.data$seurat_clusters
)
rm(seuratObj); gc()
most_common_labels <- singleR_data %>%
group_by(seurat_clusters) %>%
summarise(most_common_label = get_most_common_label(celltype), .groups = "drop")
merged_meta <- merge(full_obj@meta.data, most_common_labels,
by = "seurat_clusters", all.x = TRUE)
ordered_meta <- merged_meta[match(colnames(full_obj@assays$RNA), merged_meta$cell), ]
full_obj@meta.data <- ordered_meta
rownames(full_obj@meta.data) <- full_obj@meta.data$cell
full_obj@meta.data$celltype <- full_obj@meta.data$most_common_label
Idents(full_obj) <- full_obj@meta.data$celltype
saveRDS(full_obj, out_file)
message("Saved celltyped: ", out_file)
rm(full_obj, singleR_res, anno_df, singleR_data,
most_common_labels, merged_meta, ordered_meta); gc()