BiMix: A Bivariate Data Mixing Law for Language Model Pretraining

📅 2024-05-23

📈 Citations: 8

✨ Influential: 0

career value

202K/year

🤖 AI Summary

Existing data mixing strategies for large language model pretraining lack theoretical grounding and predictive capability, hindering principled optimization of domain distribution and total volume in training corpora. Method: We propose BiMix, the first dual-variable data mixing law that jointly models the coupled scaling relationship between domain proportions and absolute data quantities. We further design an entropy-based lightweight proxy metric for efficient mixed-data evaluation and enable high-accuracy loss extrapolation to unseen mixing configurations. Contribution/Results: Our framework achieves strong predictive performance (R² > 0.97; mean relative error < 0.2%) and significantly outperforms baseline methods across diverse, large-scale multi-domain experiments. It constitutes the first data mixing optimization framework that simultaneously ensures interpretability, computational tractability, and generalization across domains and scales.

Technology Category

Application Category

📝 Abstract

Large language models have demonstrated remarkable capabilities across various tasks, primarily attributed to the utilization of diversely sourced data. However, the impact of pretraining data composition on model performance remains poorly understood. This paper introduces $ extbf{BiMix}$, a novel bivariate data mixing law that models the joint scaling behavior of domain proportions and data volume in LLM pretraining. $ extbf{BiMix}$ provides a systematic framework for understanding and optimizing data mixtures across diverse domains. Through extensive experiments on two large-scale datasets, we demonstrate $ extbf{BiMix}$'s high accuracy in loss extrapolation (mean relative error<0.2%) and its generalization to unseen mixtures (R${}^{2}$>0.97). Optimization of domain proportions yields superior model performance compared to existing methods. Furthermore, we establish entropy-based measures as efficient proxies for data mixing, offering a computationally lightweight strategy. Our work contributes both theoretical insights into data mixing dynamics and practical tools for enhancing LLM training efficiency, paving the way for more effective scaling strategies in language model development.

Problem

Research questions and friction points this paper is trying to address.

Training Data

Large Language Models

Performance Optimization

Innovation

Methods, ideas, or system contributions that make the work stand out.

BiMix

Data Mixing Methods

Efficient Training

🔎 Similar Papers

DEPT: Decoupled Embeddings for Pre-training Language Models