🤖 AI Summary
Existing large foundation models trained on multi-domain data rely on empirical trial-and-error to determine optimal data mixing ratios, resulting in low efficiency and poor generalizability across domains and scales.
Method: We propose the first data-mixing optimization framework grounded in scaling laws, constructing a loss prediction model parameterized by model size, total training tokens, and domain-specific weights. The model is calibrated via lightweight small-scale experiments and enables accurate extrapolation to unseen data mixtures and larger models.
Contribution: This work pioneers the systematic application of scaling laws to data mixture optimization. It supports cross-modal (language, vision, native multimodal) and cross-scale performance prediction. Evaluated on LLMs, LVMs, and multimodal models, our framework achieves high accuracy (mean prediction error <2%) and strong generalization, significantly reducing the cost of large-scale training hyperparameter tuning.
📝 Abstract
Large foundation models are typically trained on data from multiple domains, with the data mixture--the proportion of each domain used--playing a critical role in model performance. The standard approach to selecting this mixture relies on trial and error, which becomes impractical for large-scale pretraining. We propose a systematic method to determine the optimal data mixture for any target domain using scaling laws. Our approach accurately predicts the loss of a model of size $N$ trained with $D$ tokens and a specific domain weight vector $h$. We validate the universality of these scaling laws by demonstrating their predictive power in three distinct and large-scale settings: large language model (LLM), native multimodal model (NMM), and large vision models (LVM) pretraining. We further show that these scaling laws can extrapolate to new data mixtures and across scales: their parameters can be accurately estimated using a few small-scale training runs, and used to estimate the performance at larger scales and unseen domain weights. The scaling laws allow to derive the optimal domain weights for any target domain under a given training budget ($N$,$D$), providing a principled alternative to costly trial-and-error methods.