Data Mixture Optimization: A Multi-fidelity Multi-scale Bayesian Framework

📅 2025-03-26

📈 Citations: 0

✨ Influential: 0

career value

196K/year

🤖 AI Summary

This work addresses the lack of theoretical foundations for data mixture ratio optimization in large language model (LLM) pretraining—currently reliant on intuition and trial-and-error. We propose the first multi-fidelity, multi-scale Bayesian optimization (BO) framework that requires no strong assumptions. Methodologically, we formulate data selection as a sequential decision problem jointly optimizing model size, training steps, and data composition; introduce a probabilistic extrapolation mechanism using Gaussian processes to explicitly model performance uncertainty; and integrate multi-fidelity BO with multi-scale adaptive sampling to leverage inexpensive noisy experiments for guiding costly full-scale training. Evaluated on a pretraining simulator built from SlimPajama (472 runs), our framework achieves 2.6×–3.3× training speedup across 20M–1B parameter models, significantly outperforming standard multi-fidelity BO and random search baselines, while demonstrating strong generalization across model scales and downstream tasks.

Technology Category

Application Category

📝 Abstract

Careful curation of data sources can significantly improve the performance of LLM pre-training, but predominant approaches rely heavily on intuition or costly trial-and-error, making them difficult to generalize across different data domains and downstream tasks. Although scaling laws can provide a principled and general approach for data curation, standard deterministic extrapolation from small-scale experiments to larger scales requires strong assumptions on the reliability of such extrapolation, whose brittleness has been highlighted in prior works. In this paper, we introduce a $ extit{probabilistic extrapolation framework}$ for data mixture optimization that avoids rigid assumptions and explicitly models the uncertainty in performance across decision variables. We formulate data curation as a sequential decision-making problem$unicode{x2013}$multi-fidelity, multi-scale Bayesian optimization$unicode{x2013}$where ${$data mixtures, model scale, training steps$}$ are adaptively selected to balance training cost and potential information gain. Our framework naturally gives rise to algorithm prototypes that leverage noisy information from inexpensive experiments to systematically inform costly training decisions. To accelerate methodological progress, we build a simulator based on 472 language model pre-training runs with varying data compositions from the SlimPajama dataset. We observe that even simple kernels and acquisition functions can enable principled decisions across training models from 20M to 1B parameters and achieve $ extbf{2.6x}$ and $ extbf{3.3x}$ speedups compared to multi-fidelity BO and random search baselines. Taken together, our framework underscores potential efficiency gains achievable by developing principled and transferable data mixture optimization methods.

Problem

Research questions and friction points this paper is trying to address.

Optimizing data mixtures for LLM pre-training efficiently

Reducing reliance on intuition and trial-and-error methods

Modeling uncertainty in performance across decision variables

Innovation

Methods, ideas, or system contributions that make the work stand out.

Probabilistic extrapolation framework for data optimization

Multi-fidelity multi-scale Bayesian optimization approach

Simulator accelerates decision-making with noisy data

🔎 Similar Papers

No similar papers found.