๐ค AI Summary
Current LLM pretraining treats all training instances with static, uniform weights, ignoring instance-level importance and its dynamic evolution during training.
Method: We propose a dynamic loss-based instance reweighting framework that enables fine-grained, online, and training-phase-adaptive weight updates. Our approach estimates per-instance losses in real time and optimizes the gradient update trajectory accordingly. Crucially, we establish the first theoretical convergence framework for loss-driven reweighting, ensuring both theoretical soundness and engineering scalability.
Contribution/Results: Applied to 7B and 1.4B model pretraining, our method accelerates convergence by 23% on average, improves downstream task performance by 1.8โ3.4 percentage points, and demonstrates strong generalization across multi-scale tasksโincluding linear regression. The core innovation lies in modeling instance importance as a time-varying loss function while guaranteeing provable convergence and practical scalability of the reweighting process.
๐ Abstract
Pretraining large language models (LLMs) on vast and heterogeneous datasets is crucial for achieving state-of-the-art performance across diverse downstream tasks. However, current training paradigms treat all samples equally, overlooking the importance or relevance of individual samples throughout the training process. Existing reweighting strategies, which primarily focus on group-level data importance, fail to leverage fine-grained instance-level information and do not adapt dynamically to individual sample importance as training progresses. In this paper, we introduce novel algorithms for dynamic, instance-level data reweighting aimed at improving both the efficiency and effectiveness of LLM pretraining. Our methods adjust the weight of each training sample based on its loss value in an online fashion, allowing the model to dynamically focus on more informative or important samples at the current training stage. In particular, our framework allows us to systematically devise reweighting strategies deprioritizing redundant or uninformative data, which we find tend to work best. Furthermore, we develop a new theoretical framework for analyzing the impact of loss-based reweighting on the convergence of gradient-based optimization, providing the first formal characterization of how these strategies affect convergence bounds. We empirically validate our approach across a spectrum of tasks, from pretraining 7B and 1.4B parameter LLMs to smaller-scale language models and linear regression problems, demonstrating that our loss-based reweighting approach can lead to faster convergence and significantly improved performance.