Beyond Statistical Co-occurrence: Unlocking Intrinsic Semantics for Tabular Data Clustering

📅 2026-04-12
📈 Citations: 0
Influential: 0
📄 PDF

career value

183K/year
🤖 AI Summary
This work addresses the limitation of existing tabular clustering methods, which predominantly rely on statistical co-occurrence while neglecting the intrinsic semantics embedded in feature names and values, often leading to semantically similar instances being erroneously separated. To overcome this, the paper proposes Tabular-Augmented Contrastive Clustering (TagCC), a novel framework that, for the first time, integrates open-world semantic knowledge into tabular clustering. TagCC leverages large language models to transform tabular data into semantic-aware textual anchors and fuses these representations with statistical features through contrastive learning, jointly optimizing the clustering objective. Experimental results demonstrate that TagCC significantly outperforms state-of-the-art methods across multiple benchmark datasets, achieving both superior clustering performance and enhanced semantic consistency.

Technology Category

Application Category

📝 Abstract
Deep Clustering (DC) has emerged as a powerful tool for tabular data analysis in real-world domains like finance and healthcare. However, most existing methods rely on data-level statistical co-occurrence to infer the latent metric space, often overlooking the intrinsic semantic knowledge encapsulated in feature names and values. As a result, semantically related concepts like `Flu' and `Cold' are often treated as symbolic tokens, causing conceptually related samples to be isolated. To bridge the gap between dataset-specific statistics and intrinsic semantic knowledge, this paper proposes Tabular-Augmented Contrastive Clustering (TagCC), a novel framework that anchors statistical tabular representations to open-world textual concepts. Specifically, TagCC utilizes Large Language Models (LLMs) to distill underlying data semantics into textual anchors via semantic-aware transformation. Through Contrastive Learning (CL), the framework enriches the statistical tabular representations with the open-world semantics encapsulated in these anchors. This CL framework is jointly optimized with a clustering objective, ensuring that the learned representations are both semantically coherent and clustering-friendly. Extensive experiments on benchmark datasets demonstrate that TagCC significantly outperforms its counterparts.
Problem

Research questions and friction points this paper is trying to address.

tabular data clustering
semantic knowledge
statistical co-occurrence
deep clustering
feature semantics
Innovation

Methods, ideas, or system contributions that make the work stand out.

Deep Clustering
Semantic-aware Transformation
Contrastive Learning
Large Language Models
Tabular Data