An Enhanced Model-based Approach for Short Text Clustering

📅 2025-07-18
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Short text clustering faces significant challenges due to data sparsity, high dimensionality, and large scale, with existing methods suffering from bottlenecks in semantic coherence and computational efficiency. To address these issues, we propose GSDMM+, an enhanced clustering model built upon the Dirichlet Multinomial Mixture (DMM) framework. Our key contributions are threefold: (1) an entropy-driven adaptive word weighting mechanism to mitigate initialization noise; (2) a dynamic cluster merging strategy that jointly optimizes granularity and intra-topic cohesion; and (3) an improved collapsed Gibbs sampling algorithm for accelerated convergence. Extensive experiments on multiple benchmark datasets demonstrate that GSDMM+ consistently outperforms classical approaches (e.g., LDA, BTM) and state-of-the-art methods (e.g., GSDMM, SCL), achieving average improvements of 8.2% in normalized mutual information (NMI) and adjusted Rand index (ARI), while reducing runtime by 37%. The source code is publicly available.

Technology Category

Application Category

📝 Abstract
Short text clustering has become increasingly important with the popularity of social media like Twitter, Google+, and Facebook. Existing methods can be broadly categorized into two paradigms: topic model-based approaches and deep representation learning-based approaches. This task is inherently challenging due to the sparse, large-scale, and high-dimensional characteristics of the short text data. Furthermore, the computational intensity required by representation learning significantly increases the running time. To address these issues, we propose a collapsed Gibbs Sampling algorithm for the Dirichlet Multinomial Mixture model (GSDMM), which effectively handles the sparsity and high dimensionality of short texts while identifying representative words for each cluster. Based on several aspects of GSDMM that warrant further refinement, we propose an improved approach, GSDMM+, designed to further optimize its performance. GSDMM+ reduces initialization noise and adaptively adjusts word weights based on entropy, achieving fine-grained clustering that reveals more topic-related information. Additionally, strategic cluster merging is employed to refine clustering granularity, better aligning the predicted distribution with the true category distribution. We conduct extensive experiments, comparing our methods with both classical and state-of-the-art approaches. The experimental results demonstrate the efficiency and effectiveness of our methods. The source code for our model is publicly available at https://github.com/chehaoa/VEMC.
Problem

Research questions and friction points this paper is trying to address.

Addresses sparsity and high dimensionality in short text clustering
Optimizes performance of GSDMM with noise reduction and adaptive weights
Refines clustering granularity via strategic merging for accurate topic distribution
Innovation

Methods, ideas, or system contributions that make the work stand out.

Uses collapsed Gibbs Sampling for Dirichlet Multinomial Mixture
Reduces initialization noise and adjusts word weights adaptively
Employs strategic cluster merging to refine granularity
🔎 Similar Papers
No similar papers found.
Enhao Cheng
Enhao Cheng
Shandong University
graphRecommendation System
Shoujia Zhang
Shoujia Zhang
Shandong University
Machine LearningData MiningGraph LearningRecommendation System
J
Jianhua Yin
School of Computer Science and Technology, Shandong University (Qingdao), Qingdao, Shandong 266237, China
Xuemeng Song
Xuemeng Song
City University of Hong Kong
Information RetrievalMultimedia Analysis
T
Tian Gan
School of Computer Science and Technology, Shandong University (Qingdao), Qingdao, Shandong 266237, China
L
Liqiang Nie
School of Computer Science and Technology, Harbin Institute of Technology (Shenzhen), Shenzhen, Guangzhou 518055, China