Decouple Searching from Training: Scaling Data Mixing via Model Merging for Large Language Model Pre-training

📅 2026-01-31
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the challenge of efficiently determining optimal data mixture ratios in large language model pretraining that balance general capabilities with performance on demanding tasks such as mathematics and code. The authors propose DeMix, a framework that decouples mixture ratio search from training by first training component models on individual candidate datasets and then constructing a data mixture proxy via weighted model merging. This approach enables efficient evaluation of vast numbers of mixture configurations without additional training, substantially reducing search costs while yielding superior mixing strategies that enhance downstream benchmark performance. The study also releases DeMix Corpora, a high-quality 22T-token pretraining dataset, to support future research.

Technology Category

Application Category

📝 Abstract
Determining an effective data mixture is a key factor in Large Language Model (LLM) pre-training, where models must balance general competence with proficiency on hard tasks such as math and code. However, identifying an optimal mixture remains an open challenge, as existing approaches either rely on unreliable tiny-scale proxy experiments or require prohibitively expensive large-scale exploration. To address this, we propose Decouple Searching from Training Mix (DeMix), a novel framework that leverages model merging to predict optimal data ratios. Instead of training proxy models for every sampled mixture, DeMix trains component models on candidate datasets at scale and derives data mixture proxies via weighted model merging. This paradigm decouples search from training costs, enabling evaluation of unlimited sampled mixtures without extra training burden and thus facilitating better mixture discovery through more search trials. Extensive experiments demonstrate that DeMix breaks the trade-off between sufficiency, accuracy and efficiency, obtaining the optimal mixture with higher benchmark performance at lower search cost. Additionally, we release the DeMix Corpora, a comprehensive 22T-token dataset comprising high-quality pre-training data with validated mixtures to facilitate open research. Our code and DeMix Corpora is available at https://github.com/Lucius-lsr/DeMix.
Problem

Research questions and friction points this paper is trying to address.

data mixture
large language model
pre-training
optimal mixture
scaling
Innovation

Methods, ideas, or system contributions that make the work stand out.

model merging
data mixture optimization
large language model pre-training
decoupled search
DeMix
S
Shengrui Li
NLP Team, Xiaohongshu Inc. Huangpu District, Shanghai, China
F
Fei Zhao
NLP Team, Xiaohongshu Inc. Huangpu District, Shanghai, China
Kaiyan Zhao
Kaiyan Zhao
The University of Tokyo
Natural Language Processing
J
Jieying Ye
NLP Team, Xiaohongshu Inc. Huangpu District, Shanghai, China
Haifeng Liu
Haifeng Liu
Zhejiang University
Machine LearningData ManagementInformaiton Retrieval
F
Fangcheng Shi
NLP Team, Xiaohongshu Inc. Huangpu District, Shanghai, China
Zheyong Xie
Zheyong Xie
Xiaohongshu Inc., University of Science and Technology of China
MultimodalLarge Language ModelAgent
Yao Hu
Yao Hu
浙江大学
Machine Learning
Shaosheng Cao
Shaosheng Cao
Xiaohongshu, DiDi Chuxing, Ant Financial, Microsoft Research
LLMsMultimodal LLMsReinforcement LearningNLPGraph Neural Networks