HyperCore: Coreset Selection under Noise via Hypersphere Models

📅 2025-09-25

📈 Citations: 0

✨ Influential: 0

career value

246K/year

🤖 AI Summary

Existing core-set selection methods neglect label noise and rely on predefined pruning ratios, limiting their practicality. This paper proposes a noise-aware hyperspherical core-set selection framework: for each class, it constructs a lightweight hyperspherical model to embed samples and identify anomalies based on intra-class centroid distances. It further introduces Youden’s J statistic to adaptively determine the pruning threshold—eliminating the need for manual hyperparameter tuning while automatically filtering mislabeled and ambiguous samples, even under low-data and high-noise conditions. Experiments demonstrate that our method significantly outperforms state-of-the-art approaches across diverse noise settings, yielding more compact and information-rich subsets. These subsets effectively support robust and scalable downstream learning tasks.

Technology Category

Application Category

📝 Abstract

The goal of coreset selection methods is to identify representative subsets of datasets for efficient model training. Yet, existing methods often ignore the possibility of annotation errors and require fixed pruning ratios, making them impractical in real-world settings. We present HyperCore, a robust and adaptive coreset selection framework designed explicitly for noisy environments. HyperCore leverages lightweight hypersphere models learned per class, embedding in-class samples close to a hypersphere center while naturally segregating out-of-class samples based on their distance. By using Youden's J statistic, HyperCore can adaptively select pruning thresholds, enabling automatic, noise-aware data pruning without hyperparameter tuning. Our experiments reveal that HyperCore consistently surpasses state-of-the-art coreset selection methods, especially under noisy and low-data regimes. HyperCore effectively discards mislabeled and ambiguous points, yielding compact yet highly informative subsets suitable for scalable and noise-free learning.

Problem

Research questions and friction points this paper is trying to address.

Selects representative data subsets despite annotation errors

Adaptively determines pruning thresholds without manual tuning

Identifies and removes mislabeled samples for robust learning

Innovation

Methods, ideas, or system contributions that make the work stand out.

Uses hypersphere models for class representation

Adaptively selects thresholds via Youden's statistic

Automatically prunes noisy data without hyperparameter tuning

🔎 Similar Papers

No similar papers found.