🤖 AI Summary
Traditional unsupervised clustering methods (e.g., k-means) struggle with high-dimensional, semantically complex text data. To address this, we propose an end-to-end unsupervised clustering framework that jointly leverages Transformer-based contextual embeddings and an enhanced autoencoder. First, semantic-rich representations are extracted using a pretrained Transformer. Second, a dual-objective reconstruction loss—combining mean squared error and cosine similarity—is introduced to improve semantic fidelity. Third, a context-aware semantic refinement module is incorporated to optimize soft cluster assignments. Finally, a joint distribution alignment loss is employed to enhance structural consistency of the clustering output. Evaluated on five benchmark datasets—including AG News and Yahoo! Answers—our method consistently outperforms state-of-the-art approaches. Notably, it achieves a new SOTA clustering accuracy of 53.63% on Yahoo! Answers, demonstrating superior capability in both semantic preservation and meaningful cluster structure discovery.
📝 Abstract
The high dimensional and semantically complex nature of textual Big data presents significant challenges for text clustering, which frequently lead to suboptimal groupings when using conventional techniques like k-means or hierarchical clustering. This work presents Semantic Deep Embedded Clustering (SDEC), an unsupervised text clustering framework that combines an improved autoencoder with transformer-based embeddings to overcome these challenges. This novel method preserves semantic relationships during data reconstruction by combining Mean Squared Error (MSE) and Cosine Similarity Loss (CSL) within an autoencoder. Furthermore, a semantic refinement stage that takes advantage of the contextual richness of transformer embeddings is used by SDEC to further improve a clustering layer with soft cluster assignments and distributional loss. The capabilities of SDEC are demonstrated by extensive testing on five benchmark datasets: AG News, Yahoo! Answers, DBPedia, Reuters 2, and Reuters 5. The framework not only outperformed existing methods with a clustering accuracy of 85.7% on AG News and set a new benchmark of 53.63% on Yahoo! Answers, but also showed robust performance across other diverse text corpora. These findings highlight the significant improvements in accuracy and semantic comprehension of text data provided by SDEC's advances in unsupervised text clustering.