SDEC: Semantic Deep Embedded Clustering

📅 2025-08-17
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Traditional unsupervised clustering methods (e.g., k-means) struggle with high-dimensional, semantically complex text data. To address this, we propose an end-to-end unsupervised clustering framework that jointly leverages Transformer-based contextual embeddings and an enhanced autoencoder. First, semantic-rich representations are extracted using a pretrained Transformer. Second, a dual-objective reconstruction loss—combining mean squared error and cosine similarity—is introduced to improve semantic fidelity. Third, a context-aware semantic refinement module is incorporated to optimize soft cluster assignments. Finally, a joint distribution alignment loss is employed to enhance structural consistency of the clustering output. Evaluated on five benchmark datasets—including AG News and Yahoo! Answers—our method consistently outperforms state-of-the-art approaches. Notably, it achieves a new SOTA clustering accuracy of 53.63% on Yahoo! Answers, demonstrating superior capability in both semantic preservation and meaningful cluster structure discovery.

Technology Category

Application Category

📝 Abstract
The high dimensional and semantically complex nature of textual Big data presents significant challenges for text clustering, which frequently lead to suboptimal groupings when using conventional techniques like k-means or hierarchical clustering. This work presents Semantic Deep Embedded Clustering (SDEC), an unsupervised text clustering framework that combines an improved autoencoder with transformer-based embeddings to overcome these challenges. This novel method preserves semantic relationships during data reconstruction by combining Mean Squared Error (MSE) and Cosine Similarity Loss (CSL) within an autoencoder. Furthermore, a semantic refinement stage that takes advantage of the contextual richness of transformer embeddings is used by SDEC to further improve a clustering layer with soft cluster assignments and distributional loss. The capabilities of SDEC are demonstrated by extensive testing on five benchmark datasets: AG News, Yahoo! Answers, DBPedia, Reuters 2, and Reuters 5. The framework not only outperformed existing methods with a clustering accuracy of 85.7% on AG News and set a new benchmark of 53.63% on Yahoo! Answers, but also showed robust performance across other diverse text corpora. These findings highlight the significant improvements in accuracy and semantic comprehension of text data provided by SDEC's advances in unsupervised text clustering.
Problem

Research questions and friction points this paper is trying to address.

Overcoming high-dimensional text clustering challenges
Improving semantic preservation in unsupervised clustering
Enhancing clustering accuracy with transformer embeddings
Innovation

Methods, ideas, or system contributions that make the work stand out.

Combining autoencoder with transformer embeddings
Using MSE and Cosine Similarity Loss
Semantic refinement with soft cluster assignments
🔎 Similar Papers
No similar papers found.
M
Mohammad Wali Ur Rahman
Department of Electrical and Computer Engineering, University of Arizona, Tucson, AZ 85721, USA
R
Ric Nevarez
Trustweb, New York, NY 10001, USA
L
Lamia Tasnim Mim
Department of Computer Science, New Mexico State University, Las Cruces, NM 88003, USA
Salim Hariri
Salim Hariri
University of Arizona
autonomic computingsecuritycloud computing