Geometric Median Matching for Robust k-Subset Selection from Noisy Data

📅 2025-04-01

📈 Citations: 0

✨ Influential: 0

career value

247K/year

🤖 AI Summary

To address robust data pruning under high-noise, large-scale settings, this paper proposes a k-subset selection method based on geometric median (GM) matching: iteratively selecting subsets whose mean approximates the GM of the full dataset, thereby overcoming the noise sensitivity inherent in conventional mean estimation. This work is the first to introduce the geometric median into data pruning. We theoretically establish a convergence rate of O(1/k), an order-of-magnitude improvement over random sampling, and prove an optimal breakdown point of 1/2—ensuring robustness against arbitrarily distributed outliers. Empirically, on image classification and generation tasks, our method maintains stable performance under extreme conditions (>30% label noise and >90% pruning ratio), significantly outperforming existing approaches. These results establish a new state-of-the-art baseline for robust data pruning.

Technology Category

Application Category

📝 Abstract

Data pruning -- the combinatorial task of selecting a small and representative subset from a large dataset, is crucial for mitigating the enormous computational costs associated with training data-hungry modern deep learning models at scale. Since large scale data collections are invariably noisy, developing data pruning strategies that remain robust even in the presence of corruption is critical in practice. However, existing data pruning methods often fail under high corruption rates due to their reliance on empirical mean estimation, which is highly sensitive to outliers. In response, we propose Geometric Median (GM) Matching, a novel k-subset selection strategy that leverages Geometric Median -- a robust estimator with an optimal breakdown point of 1/2; to enhance resilience against noisy data. Our method iteratively selects a k-subset such that the mean of the subset approximates the GM of the (potentially) noisy dataset, ensuring robustness even under arbitrary corruption. We provide theoretical guarantees, showing that GM Matching enjoys an improved O(1/k) convergence rate -- a quadratic improvement over random sampling, even under arbitrary corruption. Extensive experiments across image classification and image generation tasks demonstrate that GM Matching consistently outperforms existing pruning approaches, particularly in high-corruption settings and at high pruning rates; making it a strong baseline for robust data pruning.

Problem

Research questions and friction points this paper is trying to address.

Robust k-subset selection from noisy data

Enhancing resilience against high corruption rates

Improving convergence rate for data pruning

Innovation

Methods, ideas, or system contributions that make the work stand out.

Uses Geometric Median for robust subset selection

Iteratively matches subset mean to Geometric Median

Achieves O(1/k) convergence rate under corruption

🔎 Similar Papers

Geometric Median (GM) Matching for Robust Data Pruning