Topeax -- An Improved Clustering Topic Model with Density Peak Detection and Lexical-Semantic Term Importance

📅 2026-01-29
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the instability of existing topic models—such as Top2Vec and BERTopic—in discovering natural clusters and their tendency to overlook the interplay between semantic distance and term frequency during keyword extraction, often resulting in incoherent and insufficiently diverse topics. To overcome these limitations, we propose a novel approach that integrates density peak clustering with a joint term-frequency–semantic importance measure to automatically determine the number of topics and refine keyword selection by synergistically leveraging lexical statistics and semantic embeddings. Experimental results demonstrate that our method significantly outperforms baseline models in both cluster recovery and topic interpretability, while exhibiting greater robustness to variations in sample size and hyperparameters.

Technology Category

Application Category

📝 Abstract
Text clustering is today the most popular paradigm for topic modelling, both in academia and industry. Despite clustering topic models'apparent success, we identify a number of issues in Top2Vec and BERTopic, which remain largely unsolved. Firstly, these approaches are unreliable at discovering natural clusters in corpora, due to extreme sensitivity to sample size and hyperparameters, the default values of which result in suboptimal behaviour. Secondly, when estimating term importance, BERTopic ignores the semantic distance of keywords to topic vectors, while Top2Vec ignores word counts in the corpus. This results in, on the one hand, less coherent topics due to the presence of stop words and junk words, and lack of variety and trust on the other. In this paper, I introduce a new approach, \textbf{Topeax}, which discovers the number of clusters from peaks in density estimates, and combines lexical and semantic indices of term importance to gain high-quality topic keywords. Topeax is demonstrated to be better at both cluster recovery and cluster description than Top2Vec and BERTopic, while also exhibiting less erratic behaviour in response to changing sample size and hyperparameters.
Problem

Research questions and friction points this paper is trying to address.

topic modeling
text clustering
term importance
density peak detection
lexical-semantic analysis
Innovation

Methods, ideas, or system contributions that make the work stand out.

density peak detection
lexical-semantic term importance
topic modeling
text clustering
cluster stability
🔎 Similar Papers
No similar papers found.