🤖 AI Summary
Addressing the challenge of balancing efficiency and accuracy in large-scale data clustering, this paper proposes the Granular Ball Skeleton Clustering (GBSK) framework, integrating granular ball computing with multi-sampling ensemble learning. GBSK constructs a multi-granularity granular ball abstraction of the data space and fuses clustering results from multiple random samplings to extract a robust statistical skeleton approximating the underlying data distribution. An adaptive variant, AGBSK, is further introduced to enable automatic parameter optimization and simplified deployment. Leveraging an efficient nearest-neighbor propagation strategy, the algorithm scales to datasets containing up to 10⁸ samples and 256 dimensions on standard hardware. Compared to state-of-the-art methods, GBSK achieves substantial reductions in both time and memory overhead while maintaining high clustering accuracy. The implementation is publicly available.
📝 Abstract
To effectively handle clustering task for large-scale datasets, we propose a novel scalable skeleton clustering algorithm, namely GBSK, which leverages the granular-ball technique to capture the underlying structure of data. By multi-sampling the dataset and constructing multi-grained granular-balls, GBSK progressively uncovers a statistical "skeleton" -- a spatial abstraction that approximates the essential structure and distribution of the original data. This strategy enables GBSK to dramatically reduce computational overhead while maintaining high clustering accuracy. In addition, we introduce an adaptive version, AGBSK, with simplified parameter settings to enhance usability and facilitate deployment in real-world scenarios. Extensive experiments conducted on standard computing hardware demonstrate that GBSK achieves high efficiency and strong clustering performance on large-scale datasets, including one with up to 100 million instances across 256 dimensions. Our implementation and experimental results are available at: https://github.com/XFastDataLab/GBSK/.