HyperCore: Coreset Selection under Noise via Hypersphere Models

📅 2025-09-25
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing core-set selection methods neglect label noise and rely on predefined pruning ratios, limiting their practicality. This paper proposes a noise-aware hyperspherical core-set selection framework: for each class, it constructs a lightweight hyperspherical model to embed samples and identify anomalies based on intra-class centroid distances. It further introduces Youden’s J statistic to adaptively determine the pruning threshold—eliminating the need for manual hyperparameter tuning while automatically filtering mislabeled and ambiguous samples, even under low-data and high-noise conditions. Experiments demonstrate that our method significantly outperforms state-of-the-art approaches across diverse noise settings, yielding more compact and information-rich subsets. These subsets effectively support robust and scalable downstream learning tasks.

Technology Category

Application Category

📝 Abstract
The goal of coreset selection methods is to identify representative subsets of datasets for efficient model training. Yet, existing methods often ignore the possibility of annotation errors and require fixed pruning ratios, making them impractical in real-world settings. We present HyperCore, a robust and adaptive coreset selection framework designed explicitly for noisy environments. HyperCore leverages lightweight hypersphere models learned per class, embedding in-class samples close to a hypersphere center while naturally segregating out-of-class samples based on their distance. By using Youden's J statistic, HyperCore can adaptively select pruning thresholds, enabling automatic, noise-aware data pruning without hyperparameter tuning. Our experiments reveal that HyperCore consistently surpasses state-of-the-art coreset selection methods, especially under noisy and low-data regimes. HyperCore effectively discards mislabeled and ambiguous points, yielding compact yet highly informative subsets suitable for scalable and noise-free learning.
Problem

Research questions and friction points this paper is trying to address.

Selects representative data subsets despite annotation errors
Adaptively determines pruning thresholds without manual tuning
Identifies and removes mislabeled samples for robust learning
Innovation

Methods, ideas, or system contributions that make the work stand out.

Uses hypersphere models for class representation
Adaptively selects thresholds via Youden's statistic
Automatically prunes noisy data without hyperparameter tuning
🔎 Similar Papers
No similar papers found.