🤖 AI Summary
Existing single-cell embedding topic models suffer from two interpretability bottlenecks: reliance on subjective qualitative evaluation—leading to “interpretation collapse”—and failure to integrate external biological knowledge, hindering mechanistic discovery. To address these, we propose Knowledge-Guided Single-Cell Embedding Topic Modeling (KG-ScETM), which explicitly incorporates prior pathway and gene-set knowledge into the topic modeling process and introduces the first quantitative interpretability benchmark comprising ten metrics. KG-ScETM synergistically combines topic modeling with deep representation learning to yield biologically grounded cell embeddings and clustering. Evaluated on 20 real-world single-cell datasets, KG-ScETM consistently outperforms seven state-of-the-art methods across clustering accuracy, topic diversity, and biological coherence—including significantly improved GO enrichment significance. Our work establishes a new paradigm for interpretable single-cell analysis.
📝 Abstract
Recent advances in sequencing technologies have enabled researchers to explore cellular heterogeneity at single-cell resolution. Meanwhile, interpretability has gained prominence parallel to the rapid increase in the complexity and performance of deep learning models. In recent years, topic models have been widely used for interpretable single-cell embedding learning and clustering analysis, which we refer to as single-cell embedded topic models. However, previous studies evaluated the interpretability of the models mainly through qualitative analysis, and these single-cell embedded topic models suffer from the potential problem of interpretation collapse. Furthermore, their neglect of external biological knowledge constrains analytical performance. Here, we present scE2TM, an external knowledge-guided single-cell embedded topic model that provides a high-quality cell embedding and strong interpretation, contributing to comprehensive scRNA-seq data analysis. Our comprehensive evaluation across 20 scRNA-seq datasets demonstrates that scE2TM achieves significant clustering performance gains compared to 7 state-of-the-art methods. In addition, we propose a new interpretability evaluation benchmark that introduces 10 metrics to quantitatively assess the interpretability of single-cell embedded topic models. The results show that the interpretation provided by scE2TM performs encouragingly in terms of diversity and consistency with the underlying biological signals, contributing to a better revealing of the underlying biological mechanisms.