scE$^2$TM: Toward Interpretable Single-Cell Embedding via Topic Modeling

📅 2025-07-11
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing single-cell embedding topic models suffer from two interpretability bottlenecks: reliance on subjective qualitative evaluation—leading to “interpretation collapse”—and failure to integrate external biological knowledge, hindering mechanistic discovery. To address these, we propose Knowledge-Guided Single-Cell Embedding Topic Modeling (KG-ScETM), which explicitly incorporates prior pathway and gene-set knowledge into the topic modeling process and introduces the first quantitative interpretability benchmark comprising ten metrics. KG-ScETM synergistically combines topic modeling with deep representation learning to yield biologically grounded cell embeddings and clustering. Evaluated on 20 real-world single-cell datasets, KG-ScETM consistently outperforms seven state-of-the-art methods across clustering accuracy, topic diversity, and biological coherence—including significantly improved GO enrichment significance. Our work establishes a new paradigm for interpretable single-cell analysis.

Technology Category

Application Category

📝 Abstract
Recent advances in sequencing technologies have enabled researchers to explore cellular heterogeneity at single-cell resolution. Meanwhile, interpretability has gained prominence parallel to the rapid increase in the complexity and performance of deep learning models. In recent years, topic models have been widely used for interpretable single-cell embedding learning and clustering analysis, which we refer to as single-cell embedded topic models. However, previous studies evaluated the interpretability of the models mainly through qualitative analysis, and these single-cell embedded topic models suffer from the potential problem of interpretation collapse. Furthermore, their neglect of external biological knowledge constrains analytical performance. Here, we present scE2TM, an external knowledge-guided single-cell embedded topic model that provides a high-quality cell embedding and strong interpretation, contributing to comprehensive scRNA-seq data analysis. Our comprehensive evaluation across 20 scRNA-seq datasets demonstrates that scE2TM achieves significant clustering performance gains compared to 7 state-of-the-art methods. In addition, we propose a new interpretability evaluation benchmark that introduces 10 metrics to quantitatively assess the interpretability of single-cell embedded topic models. The results show that the interpretation provided by scE2TM performs encouragingly in terms of diversity and consistency with the underlying biological signals, contributing to a better revealing of the underlying biological mechanisms.
Problem

Research questions and friction points this paper is trying to address.

Addresses interpretability collapse in single-cell topic models
Integrates external biological knowledge to enhance analytical performance
Proposes quantitative metrics for evaluating model interpretability
Innovation

Methods, ideas, or system contributions that make the work stand out.

External knowledge-guided topic modeling
Quantitative interpretability evaluation benchmark
High-quality cell embedding via scE2TM
🔎 Similar Papers
No similar papers found.
H
Hegang Chen
School of Computer Science and Engineering, Sun Yat-sen University, 132 Waihuan East Road, Guangzhou, 510006, China.
Y
Yuyin Lu
School of Computer Science and Engineering, Sun Yat-sen University, 132 Waihuan East Road, Guangzhou, 510006, China.
Z
Zhiming Dai
School of Computer Science and Engineering, Sun Yat-sen University, 132 Waihuan East Road, Guangzhou, 510006, China.
Fu Lee Wang
Fu Lee Wang
Hong Kong Metropolitan University
AIData ScienceLearning Technology
Q
Qing Li
Department of Computing, The Hong Kong Polytechnic University, Street, Hong Kong, 610101, China.
Yanghui Rao
Yanghui Rao
Sun Yat-sen University
Text MiningTopic ModelingRepresentation Learning