🤖 AI Summary
Existing sample influence estimation methods are limited to converged models and overlook dynamic changes during optimization. To address this, we propose a layer-aware online data value estimation method. Our approach introduces, for the first time, a layer-aware mechanism that constructs a lightweight estimator based on gradients of the loss with respect to model outputs—bypassing expensive full-network or parameter-level gradient computations. It enables real-time, fine-grained influence assessment throughout training, without requiring model convergence, thereby significantly improving timeliness and scalability. Extensive experiments on LLM pretraining/finetuning and image classification tasks demonstrate that our method achieves higher accuracy with substantially lower time and memory overhead, validating its effectiveness and practicality for dynamic data curation.
📝 Abstract
Data-centric learning emphasizes curating high-quality training samples to boost performance rather than designing new architectures. A central problem is to estimate the influence of training sample efficiently. Prior studies largely focus on static influence measured on a converged model, overlooking how data valuation dynamically changes during optimization. This omission neglects the dynamic nature of sample influence during optimization, especially in deep models. To address the computational burden of frequent influence estimation, we develop a layer-aware online estimator that requires only loss-to-output gradients. This design avoids parameter-level and full-network gradients while preserving ranking fidelity. Extensive experiments across LLM pretraining, fine-tuning, and image classification show our method improves accuracy with substantially lower time and memory cost, making dynamic data curation efficient and scalable in practice.