From Large to Small Datasets: Size Generalization for Clustering Algorithm Selection

📅 2024-02-22

🏛️ arXiv.org

📈 Citations: 1

✨ Influential: 0

career value

232K/year

🤖 AI Summary

Selecting optimal clustering algorithms for large-scale data remains challenging, particularly under semi-supervised settings where only costly oracle queries yield partial ground-truth labels. Method: This paper introduces and formalizes the novel “scale generalization” theory: if an algorithm is optimal on a small subsample, it remains optimal on the full dataset. We derive sufficient conditions for scale generalization for single-linkage, k-means++, and smoothed Gonzalez k-center algorithms. Our framework integrates clustering stability analysis, sampling theory, and semi-supervised learning, employing subsampling-based evaluation coupled with oracle validation. Results: Empirical evaluation across multiple real-world datasets demonstrates that just 5% of the data suffices to identify the globally optimal algorithm with 100% accuracy. The core contribution is the first provably sound and empirically verifiable theory of scale generalization for clustering algorithms, enabling efficient, robust algorithm selection without full-label supervision.

Technology Category

Application Category

📝 Abstract

In clustering algorithm selection, we are given a massive dataset and must efficiently select which clustering algorithm to use. We study this problem in a semi-supervised setting, with an unknown ground-truth clustering that we can only access through expensive oracle queries. Ideally, the clustering algorithm's output will be structurally close to the ground truth. We approach this problem by introducing a notion of size generalization for clustering algorithm accuracy. We identify conditions under which we can (1) subsample the massive clustering instance, (2) evaluate a set of candidate algorithms on the smaller instance, and (3) guarantee that the algorithm with the best accuracy on the small instance will have the best accuracy on the original big instance. We provide theoretical size generalization guarantees for three classic clustering algorithms: single-linkage, k-means++, and (a smoothed variant of) Gonzalez's k-centers heuristic. We validate our theoretical analysis with empirical results, observing that on real-world clustering instances, we can use a subsample of as little as 5% of the data to identify which algorithm is best on the full dataset.

Problem

Research questions and friction points this paper is trying to address.

Selecting clustering algorithms efficiently for massive datasets

Using subsampling to generalize algorithm accuracy from small to large instances

Providing theoretical guarantees for three classic clustering algorithms

Innovation

Methods, ideas, or system contributions that make the work stand out.

Subsampling massive datasets for algorithm selection

Providing theoretical size generalization guarantees

Validating with empirical results on real-world data

🔎 Similar Papers

No similar papers found.