🤖 AI Summary
To address the inefficiency and high computational cost of static data selection in large language model (LLM) training, this paper proposes the Dynamic Weighted Mining (DWM) framework, which jointly optimizes data weights and model parameters during training. DWM employs a bilevel optimization mechanism to adaptively update data weights on a per-batch basis, thereby revealing—systematically and for the first time—the evolutionary patterns of model data preferences across training stages. Experiments demonstrate that DWM significantly outperforms random sampling baselines in final model performance. Moreover, the learned data weighting strategy exhibits strong transferability across diverse data selection methods and model scales, enabling plug-and-play enhancement of multiple training paradigms—including curriculum learning, loss-based reweighting, and mixture-of-experts scheduling. This work establishes a new paradigm for efficient, adaptive data scheduling in LLM training, advancing both empirical effectiveness and theoretical understanding of dynamic dataset optimization.
📝 Abstract
While large-scale training data is fundamental for developing capable large language models (LLMs), strategically selecting high-quality data has emerged as a critical approach to enhance training efficiency and reduce computational costs. Current data selection methodologies predominantly rely on static, training-agnostic criteria, failing to account for the dynamic model training and data interactions. In this paper, we propose a new Data Weighting Model (DWM) to adjust the weight of selected data within each batch to achieve a dynamic data utilization during LLM training. Specially, to better capture the dynamic data preference of the trained model, a bi-level optimization framework is implemented to update the weighting model. Our experiments demonstrate that DWM enhances the performance of models trained with randomly-selected data, and the learned weighting model can be transferred to enhance other data selection methods and models of different sizes. Moreover, we further analyze how a model's data preferences evolve throughout training, providing new insights into the data preference of the model during training.