๐ค AI Summary
This work addresses the scarcity of high-quality training data for large language models and the lack of theoretical guidance for jointly scheduling data quality and training dynamics. The authors propose a quality-aware scaling law, deriving an asymptotic closed-form solution that reveals the dual role of high-quality data under both noise-limited and signal-limited regimesโovercoming the limitation of conventional curriculum learning, which typically reserves high-quality data only for later training stages. Building on this insight, they introduce a Drop-Stable-Rampup joint scheduling strategy integrated with a Mixture-of-Experts architecture. Evaluated on a 15B MoE model, their approach achieves substantial performance gains, improving average accuracy by 1.70 and 2.98 points over WSD and Cosine-decay baselines, respectively, with particularly strong results on mathematical reasoning benchmarks such as GSM8K and MATH.
๐ Abstract
High-quality data is scarce in large language model (LLM) training, yet how to schedule its use jointly with training dynamics lacks theoretical guidance. We extend functional scaling laws by incorporating a data-quality dimension, and solve the joint data-quality and batch-size scheduling problem in asymptotic closed form. The solution reveals two regimes and a dual role of high-quality data. In the noise-limited regime, high-quality data should be used as a signal amplifier: lowering the batch size converts cleaner data into more signal without amplifying noise. In the signal-limited regime, it should be used as a noise suppressor: late placement reduces terminal noise without sacrificing signal accumulation. Existing curriculum-style pipelines primarily exploit the second role by placing cleaner data late, but miss the first role because conventional decay schedules reduce update intensity exactly when high-quality data becomes available. Guided by this, we propose Drop-Stable-Rampup for LLM midtraining: upon the quality transition, drop the batch size, hold it stable to accumulate signal, then ramp up to suppress terminal noise. On a 15B Mixture-of-Experts model midtrained on 108B tokens, Drop-Stable-Rampup improves average accuracy over Warmup-Stable-Decay (WSD) by +1.70 and over Cosine-decay by +2.98, with particularly large gains on mathematical reasoning benchmarks such as GSM8K (+4.23) and MATH (+2.80).