🤖 AI Summary
This work addresses the challenge of automatically determining the optimal number of clusters for high-dimensional, complex data—such as large-scale images—without any prior information. It proposes a novel method that circumvents assumptions about data distribution and does not require complete clustering results. The approach reformulates cluster number estimation as a dynamic comparison of positional relationships among cluster centers, introducing for the first time a sample confidence filtering mechanism to exclude low-confidence boundary samples. By integrating bipartite graph modeling with a pairwise center-matching strategy, the method achieves robust performance. Extensive experiments on challenging benchmarks, including CIFAR-10 and STL-10, demonstrate its significant superiority over current state-of-the-art techniques, highlighting enhanced robustness and adaptability.
📝 Abstract
One of the main challenges in data mining is choosing the optimal number of clusters without prior information. Notably, existing methods are usually in the philosophy of cluster validation and hence have underlying assumptions on data distribution, which prevents their application to complex data such as large-scale images and high-dimensional data from the real world. In this regard, we propose an approach named CNMBI. Leveraging the distribution information inherent in the data space, we map the target task as a dynamic comparison process between cluster centers regarding positional behavior, without relying on the complete clustering results and designing the complex validity index as before. Bipartite graph theory is then employed to efficiently model this process. Additionally, we find that different samples have different confidence levels and thereby actively remove low-confidence ones, which is, for the first time to our knowledge, considered in cluster number determination. CNMBI is robust and allows for more flexibility in the dimension and shape of the target data (e.g., CIFAR-10 and STL-10). Extensive comparison studies with state-of-the-art competitors on various challenging datasets demonstrate the superiority of our method.