🤖 AI Summary
This work addresses the limitation of existing tabular clustering methods, which predominantly rely on statistical co-occurrence while neglecting the intrinsic semantics embedded in feature names and values, often leading to semantically similar instances being erroneously separated. To overcome this, the paper proposes Tabular-Augmented Contrastive Clustering (TagCC), a novel framework that, for the first time, integrates open-world semantic knowledge into tabular clustering. TagCC leverages large language models to transform tabular data into semantic-aware textual anchors and fuses these representations with statistical features through contrastive learning, jointly optimizing the clustering objective. Experimental results demonstrate that TagCC significantly outperforms state-of-the-art methods across multiple benchmark datasets, achieving both superior clustering performance and enhanced semantic consistency.
📝 Abstract
Deep Clustering (DC) has emerged as a powerful tool for tabular data analysis in real-world domains like finance and healthcare. However, most existing methods rely on data-level statistical co-occurrence to infer the latent metric space, often overlooking the intrinsic semantic knowledge encapsulated in feature names and values. As a result, semantically related concepts like `Flu' and `Cold' are often treated as symbolic tokens, causing conceptually related samples to be isolated. To bridge the gap between dataset-specific statistics and intrinsic semantic knowledge, this paper proposes Tabular-Augmented Contrastive Clustering (TagCC), a novel framework that anchors statistical tabular representations to open-world textual concepts. Specifically, TagCC utilizes Large Language Models (LLMs) to distill underlying data semantics into textual anchors via semantic-aware transformation. Through Contrastive Learning (CL), the framework enriches the statistical tabular representations with the open-world semantics encapsulated in these anchors. This CL framework is jointly optimized with a clustering objective, ensuring that the learned representations are both semantically coherent and clustering-friendly. Extensive experiments on benchmark datasets demonstrate that TagCC significantly outperforms its counterparts.