🤖 AI Summary
This work addresses the absence of a unified and efficient data mixing strategy spanning pretraining, continual learning, and instruction tuning in current language model training pipelines, which often rely on costly proxy models. The authors propose OP-Mix, the first end-to-end framework for full-pipeline data mixing that eliminates the need for proxy models by making dynamic mixing decisions directly based on the main model. OP-Mix trains low-rank adapters (LoRA) on the current model and uses interpolation to simulate candidate mixing strategies, enabling online policy optimization and dynamic data scheduling. Experiments demonstrate that OP-Mix reduces average perplexity by 6.3% during pretraining and achieves performance comparable to full retraining or online distillation in continual learning while using only 4%–34% of the computational budget.
📝 Abstract
Data mixing decides how to combine different sources or types of data and is a consequential problem throughout language model training. In pretraining, data composition is a key determinant of model quality; in continual learning and adaptation, it governs what is retained and acquired. Yet existing data mixing methods address only one phase of this lifecycle at a time: some require smaller proxy models tied to a single training phase, others assume a fixed domain set, and continual learning lacks principled guidance altogether. We argue that data mixing is fundamentally an online decision making problem -- one that recurs throughout training and demands a single, unified solution. We introduce OP-Mix (On-Policy Mix), a data mixing algorithm that operates across the entire language model training lifecycle. Our main insight is that candidate data mixtures can be cheaply simulated by interpolating between low-rank adapters trained directly on the current model, eliminating separate proxy models and ensuring the search is always grounded in the model's actual learning dynamics. Across pretraining, continual midtraining, and continual instruction tuning, OP-Mix consistently finds near-optimal mixtures while using a fraction of the compute of the baselines. In pretraining, OP-Mix improves upon training without mixing by 6.3% in average perplexity. For continual learning, OP-Mix matches the performance of both retraining and on-policy distillation while using 66% and 95% less overall compute, respectively. OP-Mix suggests a different view of language model training: not a sequence of distinct phases, but a single continuous process of learning from data.