🤖 AI Summary
Selecting optimal clustering algorithms for large-scale data remains challenging, particularly under semi-supervised settings where only costly oracle queries yield partial ground-truth labels. Method: This paper introduces and formalizes the novel “scale generalization” theory: if an algorithm is optimal on a small subsample, it remains optimal on the full dataset. We derive sufficient conditions for scale generalization for single-linkage, k-means++, and smoothed Gonzalez k-center algorithms. Our framework integrates clustering stability analysis, sampling theory, and semi-supervised learning, employing subsampling-based evaluation coupled with oracle validation. Results: Empirical evaluation across multiple real-world datasets demonstrates that just 5% of the data suffices to identify the globally optimal algorithm with 100% accuracy. The core contribution is the first provably sound and empirically verifiable theory of scale generalization for clustering algorithms, enabling efficient, robust algorithm selection without full-label supervision.
📝 Abstract
In clustering algorithm selection, we are given a massive dataset and must efficiently select which clustering algorithm to use. We study this problem in a semi-supervised setting, with an unknown ground-truth clustering that we can only access through expensive oracle queries. Ideally, the clustering algorithm's output will be structurally close to the ground truth. We approach this problem by introducing a notion of size generalization for clustering algorithm accuracy. We identify conditions under which we can (1) subsample the massive clustering instance, (2) evaluate a set of candidate algorithms on the smaller instance, and (3) guarantee that the algorithm with the best accuracy on the small instance will have the best accuracy on the original big instance. We provide theoretical size generalization guarantees for three classic clustering algorithms: single-linkage, k-means++, and (a smoothed variant of) Gonzalez's k-centers heuristic. We validate our theoretical analysis with empirical results, observing that on real-world clustering instances, we can use a subsample of as little as 5% of the data to identify which algorithm is best on the full dataset.