🤖 AI Summary
Existing data mixing strategies for large language model pretraining lack theoretical grounding and predictive capability, hindering principled optimization of domain distribution and total volume in training corpora.
Method: We propose BiMix, the first dual-variable data mixing law that jointly models the coupled scaling relationship between domain proportions and absolute data quantities. We further design an entropy-based lightweight proxy metric for efficient mixed-data evaluation and enable high-accuracy loss extrapolation to unseen mixing configurations.
Contribution/Results: Our framework achieves strong predictive performance (R² > 0.97; mean relative error < 0.2%) and significantly outperforms baseline methods across diverse, large-scale multi-domain experiments. It constitutes the first data mixing optimization framework that simultaneously ensures interpretability, computational tractability, and generalization across domains and scales.
📝 Abstract
Large language models have demonstrated remarkable capabilities across various tasks, primarily attributed to the utilization of diversely sourced data. However, the impact of pretraining data composition on model performance remains poorly understood. This paper introduces $ extbf{BiMix}$, a novel bivariate data mixing law that models the joint scaling behavior of domain proportions and data volume in LLM pretraining. $ extbf{BiMix}$ provides a systematic framework for understanding and optimizing data mixtures across diverse domains. Through extensive experiments on two large-scale datasets, we demonstrate $ extbf{BiMix}$'s high accuracy in loss extrapolation (mean relative error<0.2%) and its generalization to unseen mixtures (R${}^{2}$>0.97). Optimization of domain proportions yields superior model performance compared to existing methods. Furthermore, we establish entropy-based measures as efficient proxies for data mixing, offering a computationally lightweight strategy. Our work contributes both theoretical insights into data mixing dynamics and practical tools for enhancing LLM training efficiency, paving the way for more effective scaling strategies in language model development.