🤖 AI Summary
Existing categorical clustering methods suffer from two key limitations: (i) the absence of prior semantic relationships among categories, and (ii) the rigid assumption of fixed-distance metrics, which poorly accommodates diverse cluster structures. To address these, this paper proposes a cluster-adaptive, learnable distance metric. Our core contributions are threefold: (i) the first formulation of *cluster-specific* categorical relationship modeling—eliminating reliance on predefined topological structures; (ii) a differentiable distance framework that jointly optimizes categorical relationships and clustering objectives via end-to-end learning; and (iii) native compatibility with Euclidean distance, enabling seamless extension to mixed-type data. Extensive experiments across 12 real-world categorical datasets demonstrate state-of-the-art performance: our method achieves a mean clustering accuracy rank of 1.25, substantially outperforming the current best approach (rank 5.21), thereby validating its superior capacity to model complex distributions and generalize across heterogeneous data.
📝 Abstract
Categorical attributes with qualitative values are ubiquitous in cluster analysis of real datasets. Unlike the Euclidean distance of numerical attributes, the categorical attributes lack well-defined relationships of their possible values (also called categories interchangeably), which hampers the exploration of compact categorical data clusters. Although most attempts are made for developing appropriate distance metrics, they typically assume a fixed topological relationship between categories when learning distance metrics, which limits their adaptability to varying cluster structures and often leads to suboptimal clustering performance. This paper, therefore, breaks the intrinsic relationship tie of attribute categories and learns customized distance metrics suitable for flexibly and accurately revealing various cluster distributions. As a result, the fitting ability of the clustering algorithm is significantly enhanced, benefiting from the learnable category relationships. Moreover, the learned category relationships are proved to be Euclidean distance metric-compatible, enabling a seamless extension to mixed datasets that include both numerical and categorical attributes. Comparative experiments on 12 real benchmark datasets with significance tests show the superior clustering accuracy of the proposed method with an average ranking of 1.25, which is significantly higher than the 5.21 ranking of the current best-performing method.