Categorical data clustering: 25 years beyond K-modes

📅 2024-08-30
🏛️ arXiv.org
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This paper addresses longstanding challenges in categorical data clustering—particularly low accuracy, poor robustness to noise, and limited scalability—focusing on nominal and ordinal data. We systematically survey 25 years of research on K-modes and its variants, identifying key methodological bottlenecks. For the first time, we empirically benchmark 12 open-source state-of-the-art algorithms—including K-modes extensions, information-theoretic models, graph neural networks, probabilistic generative models, and ensemble methods—across standard benchmark datasets, evaluating clustering quality, noise robustness, and computational scalability. Our analysis delineates performance boundaries and domain-specific applicability, uncovering persistent challenges: modeling bias, heterogeneous feature coupling, and poor generalization under small-sample conditions. Based on these findings, we propose a domain-aware algorithm selection guideline tailored to healthcare, social sciences, and engineering applications. This work establishes the most comprehensive empirical benchmark and methodological framework for categorical data clustering to date.

Technology Category

Application Category

📝 Abstract
The clustering of categorical data is a common and important task in computer science, offering profound implications across a spectrum of applications. Unlike purely numerical data, categorical data often lack inherent ordering as in nominal data, or have varying levels of order as in ordinal data, thus requiring specialized methodologies for efficient organization and analysis. This review provides a comprehensive synthesis of categorical data clustering in the past twenty-five years, starting from the introduction of K-modes. It elucidates the pivotal role of categorical data clustering in diverse fields such as health sciences, natural sciences, social sciences, education, engineering and economics. Practical comparisons are conducted for algorithms having public implementations, highlighting distinguishing clustering methodologies and revealing the performance of recent algorithms on several benchmark categorical datasets. Finally, challenges and opportunities in the field are discussed.
Problem

Research questions and friction points this paper is trying to address.

Clustering Analysis
Categorical Data
K-modes Optimization
Innovation

Methods, ideas, or system contributions that make the work stand out.

Categorical Data Clustering
Algorithm Performance Evaluation
Future Prospects
🔎 Similar Papers
No similar papers found.
T
Tai Dinh
The Kyoto College of Graduate Studies for Informatics, 7 Tanaka Monzencho, Sakyo Ward, Kyoto City, Kyoto, Japan
W
Wong Hauchi
The Kyoto College of Graduate Studies for Informatics, 7 Tanaka Monzencho, Sakyo Ward, Kyoto City, Kyoto, Japan
Philippe Fournier-Viger
Philippe Fournier-Viger
Distinguished professor, Shenzhen University, China
Data MiningArtificial IntelligenceBig DataPattern MiningComplex data
D
D. Lisik
University of Gothenburg, Medicinaregatan 1F, 413 90, Göteborg, Sweden
M
Minh-Quyet Ha
Japan Advanced Institute of Science and Technology, 1-1 Asahidai, Nomi, Ishikawa, Japan
H
H. Dam
Japan Advanced Institute of Science and Technology, 1-1 Asahidai, Nomi, Ishikawa, Japan
V
Van-Nam Huynh
Japan Advanced Institute of Science and Technology, 1-1 Asahidai, Nomi, Ishikawa, Japan