🤖 AI Summary
Short text clustering faces significant challenges due to data sparsity, high dimensionality, and large scale, with existing methods suffering from bottlenecks in semantic coherence and computational efficiency. To address these issues, we propose GSDMM+, an enhanced clustering model built upon the Dirichlet Multinomial Mixture (DMM) framework. Our key contributions are threefold: (1) an entropy-driven adaptive word weighting mechanism to mitigate initialization noise; (2) a dynamic cluster merging strategy that jointly optimizes granularity and intra-topic cohesion; and (3) an improved collapsed Gibbs sampling algorithm for accelerated convergence. Extensive experiments on multiple benchmark datasets demonstrate that GSDMM+ consistently outperforms classical approaches (e.g., LDA, BTM) and state-of-the-art methods (e.g., GSDMM, SCL), achieving average improvements of 8.2% in normalized mutual information (NMI) and adjusted Rand index (ARI), while reducing runtime by 37%. The source code is publicly available.
📝 Abstract
Short text clustering has become increasingly important with the popularity of social media like Twitter, Google+, and Facebook. Existing methods can be broadly categorized into two paradigms: topic model-based approaches and deep representation learning-based approaches. This task is inherently challenging due to the sparse, large-scale, and high-dimensional characteristics of the short text data. Furthermore, the computational intensity required by representation learning significantly increases the running time. To address these issues, we propose a collapsed Gibbs Sampling algorithm for the Dirichlet Multinomial Mixture model (GSDMM), which effectively handles the sparsity and high dimensionality of short texts while identifying representative words for each cluster. Based on several aspects of GSDMM that warrant further refinement, we propose an improved approach, GSDMM+, designed to further optimize its performance. GSDMM+ reduces initialization noise and adaptively adjusts word weights based on entropy, achieving fine-grained clustering that reveals more topic-related information. Additionally, strategic cluster merging is employed to refine clustering granularity, better aligning the predicted distribution with the true category distribution. We conduct extensive experiments, comparing our methods with both classical and state-of-the-art approaches. The experimental results demonstrate the efficiency and effectiveness of our methods. The source code for our model is publicly available at https://github.com/chehaoa/VEMC.