SubZeroCore: A Submodular Approach with Zero Training for Coreset Selection

📅 2025-09-25

📈 Citations: 0

✨ Influential: 0

career value

203K/year

🤖 AI Summary

Existing coreset selection methods rely on full-dataset training signals—such as gradients or forgetting counts—contradicting their fundamental goal of reducing training overhead. Method: We propose the first fully training-free coreset construction framework, unifying submodular coverage and density-awareness in a single principled model. We design a closed-form, analytically solvable sampling strategy governed by only one hyperparameter that controls local density coverage. Grounded in submodular optimization theory, our method selects representative samples efficiently—without any model training or gradient computation. Contribution/Results: Extensive experiments demonstrate that our approach significantly outperforms training-based baselines under high pruning ratios, reduces computational cost by 1–2 orders of magnitude, exhibits superior robustness to label noise, and scales effectively to large datasets. This establishes a new paradigm for efficient, scalable, and robust coreset selection.

Technology Category

Application Category

📝 Abstract

The goal of coreset selection is to identify representative subsets of datasets for efficient model training. Yet, existing approaches paradoxically require expensive training-based signals, e.g., gradients, decision boundary estimates or forgetting counts, computed over the entire dataset prior to pruning, which undermines their very purpose by requiring training on samples they aim to avoid. We introduce SubZeroCore, a novel, training-free coreset selection method that integrates submodular coverage and density into a single, unified objective. To achieve this, we introduce a sampling strategy based on a closed-form solution to optimally balance these objectives, guided by a single hyperparameter that explicitly controls the desired coverage for local density measures. Despite no training, extensive evaluations show that SubZeroCore matches training-based baselines and significantly outperforms them at high pruning rates, while dramatically reducing computational overhead. SubZeroCore also demonstrates superior robustness to label noise, highlighting its practical effectiveness and scalability for real-world scenarios.

Problem

Research questions and friction points this paper is trying to address.

Selects representative data subsets without training

Balances submodular coverage and density objectives

Reduces computational costs while maintaining performance

Innovation

Methods, ideas, or system contributions that make the work stand out.

Training-free coreset selection using submodular objectives

Balances coverage and density with closed-form solution

Single hyperparameter controls local density coverage

🔎 Similar Papers

No similar papers found.