Coresets from Trajectories: Selecting Data via Correlation of Loss Differences

📅 2025-08-27
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address scalability bottlenecks of deep learning in real-time and resource-constrained settings, this paper proposes the Correlation of Loss Differences (CLD) criterion for efficient selection of the most influential training samples to construct high-quality coresets. CLD measures sample-level loss-change correlations via validation-set loss trajectories—requiring neither gradients nor second-order information—and provides theoretical convergence guarantees. It supports cross-model transferability, stable coreset selection from early training checkpoints, and automatic class balancing, thereby significantly reducing bias. Experiments on CIFAR-100 and ImageNet-1K demonstrate that CLD-derived coresets match or surpass state-of-the-art methods in accuracy, with average performance gaps under 1% relative to high-cost baselines and cross-architecture transfer errors below 1%. CLD thus achieves a favorable trade-off among computational efficiency, selection stability, and generalization across diverse model architectures.

Technology Category

Application Category

📝 Abstract
Deep learning models achieve state-of-the-art performance across domains but face scalability challenges in real-time or resource-constrained scenarios. To address this, we propose Correlation of Loss Differences (CLD), a simple and scalable metric for coreset selection that identifies the most impactful training samples by measuring their alignment with the loss trajectories of a held-out validation set. CLD is highly efficient, requiring only per-sample loss values computed at training checkpoints, and avoiding the costly gradient and curvature computations used in many existing subset selection methods. We develop a general theoretical framework that establishes convergence guarantees for CLD-based coresets, demonstrating that the convergence error is upper-bounded by the alignment of the selected samples and the representativeness of the validation set. On CIFAR-100 and ImageNet-1k, CLD-based coresets typically outperform or closely match state-of-the-art methods across subset sizes, and remain within 1% of more computationally expensive baselines even when not leading. CLD transfers effectively across architectures (ResNet, VGG, DenseNet), enabling proxy-to-target selection with <1% degradation. Moreover, CLD is stable when using only early checkpoints, incurring negligible accuracy loss. Finally, CLD exhibits inherent bias reduction via per-class validation alignment, obviating the need for additional stratified sampling. Together, these properties make CLD a principled, efficient, stable, and transferable tool for scalable dataset optimization.
Problem

Research questions and friction points this paper is trying to address.

Proposing CLD metric for efficient coreset selection
Reducing computational costs without gradient calculations
Ensuring convergence and performance across architectures
Innovation

Methods, ideas, or system contributions that make the work stand out.

CLD metric for coreset selection via loss trajectories
Efficient per-sample loss computation without gradients
Theoretical convergence guarantees with alignment-based error bounds
🔎 Similar Papers
No similar papers found.