scShapeBench: Discovering geometry from high dimensional scRNAseq data

📅 2026-05-12
📈 Citations: 0
Influential: 0
📄 PDF

career value

216K/year
🤖 AI Summary
Single-cell RNA sequencing data exhibit diverse geometric structures—such as clusters, trajectories, and branches—yet existing analytical methods often assume predefined shapes and lack the capacity to automatically infer the intrinsic geometry of the data. To address this limitation, this work introduces scShapeBench, the first comprehensive benchmark specifically designed for geometric structure identification in single-cell data, comprising both synthetic and expert-annotated real datasets. Furthermore, the authors propose scReebTower, a novel method grounded in diffusion geometry that automatically detects data shape by extracting Reeb graphs. By integrating diffusion geometry with topological skeleton sampling, scReebTower bridges the gap between visualization and the selection of downstream analysis pipelines. Experimental results demonstrate that scReebTower outperforms existing approaches such as PAGA and Mapper across multiple datasets, confirming its effectiveness in automated geometric structure recognition.
📝 Abstract
High-dimensional point cloud data arise across many scientific domains, especially single-cell biology. The shapes or topologies of these datasets determine the types of information that can be extracted. For example, clustered data supports cell-type identification, trajectory structures support transition analysis, and archetypal structures capture continua of cellular behaviors. Existing analysis pipelines often assume a specific shape. The standard Seurat pipeline combines UMAP visualization with Louvain clustering and therefore assumes clustered data, while tools such as Monocle and SPADE assume tree-like structures, and flow-based models such as MIOFlow and Conditional Flow Matching target trajectories. Choosing which pipeline to apply is therefore often left to bioinformaticians who visually inspect datasets before selecting an analysis strategy. With the rise of agentic AI scientists, automating shape detection is increasingly important for selecting downstream analysis pipelines. To address this problem, we introduce scShapeBench, a benchmark dataset for shape detection containing both synthetic and expert-annotated single-cell datasets. Synthetic datasets are sampled from ground-truth skeleton graphs with controlled variance. Real single-cell datasets are curated from diverse sources and annotated by experts into four categories: clusters, single trajectory, multi-branching, and archetypal. We additionally introduce scReebTower, a baseline method that uses diffusion geometry to extract Reeb graphs and connect visualization with pipeline selection. We provide topology-aware evaluation metrics and compare scReebTower against PAGA and Mapper on synthetic and real data. Our results indicate that scReebTower outperforms existing baselines. Overall, our contributions span benchmarks, evaluation metrics, and a baseline for automated shape detection in single-cell data.
Problem

Research questions and friction points this paper is trying to address.

single-cell RNA-seq
shape detection
high-dimensional data
topology
automated analysis
Innovation

Methods, ideas, or system contributions that make the work stand out.

shape detection
single-cell RNA-seq
Reeb graph
diffusion geometry
benchmark dataset
🔎 Similar Papers
2024-04-09International Conference on Database Systems for Advanced ApplicationsCitations: 7
Andrew J Steindl
Andrew J Steindl
PhD Student, Yale University
GeometryTopologyManifold Learning
João Felipe Rocha
João Felipe Rocha
PhD Student, Yale University
data sciencebioinformaticsgraph signal processingdeep learning
B
Brian Tshilengi Di Bassinga
Yale University
Zachary Warren
Zachary Warren
Vanderbilt University
Autism
M
Matthew Scicluna
Mila / Université de Montréal
C
César Miguel Valdez Córdova
Mila / Université de Montréal
S
Shabarni Gupta
Garvan Institute of Medical Research
L
Leire Torices
Garvan Institute of Medical Research
D
Daniel Neumann
School of Biomedical Sciences, University of New South Wales
T
Timothy J. Mann
School of Biomedical Sciences, University of New South Wales
I
Ihuan Gunawan
School of Biomedical Sciences, University of New South Wales
Dhananjay Bhaskar
Dhananjay Bhaskar
Assistant Professor, UW-Madison
Topological Data AnalysisComputational BiologyAgent-Based ModelingMachine Learning
J
John G Lock
School of Biomedical Sciences, University of New South Wales
C
Christine L Chaffer
Garvan Institute of Medical Research
Guy Wolf
Guy Wolf
Université de Montréal; Mila
Exploratory Data AnalysisDimensionality ReductionManifold LearningGeometric Deep LearningGraph Signal Processing
Smita Krishnaswamy
Smita Krishnaswamy
Yale University
Machine LearningData MiningManifold LearningDeep LearningComputational Biology