MergeMix: Optimizing Mid-Training Data Mixtures via Learnable Model Merging

📅 2026-01-25

📈 Citations: 0

✨ Influential: 0

career value

198K/year

🤖 AI Summary

This work proposes an efficient method for optimizing data mixture ratios in large language model training, circumventing the need for costly trial-and-error or proxy training. The approach trains specialized expert models on small amounts of domain-specific data and leverages learnable model fusion weights as a high-fidelity, low-cost performance proxy. By optimizing these fusion weights on downstream tasks, the method automatically derives near-optimal data mixture proportions. Notably, this is the first study to employ fusion weights for data mixture optimization, demonstrating strong consistency—achieving Spearman correlation coefficients exceeding 0.9—and transferability across model scales. Evaluated on 8B and 16B parameter models, the method matches or surpasses manually tuned baselines while substantially reducing the computational cost of hyperparameter search.

Technology Category

Application Category

📝 Abstract

Optimizing data mixtures is essential for unlocking the full potential of large language models (LLMs), yet identifying the optimal composition remains computationally prohibitive due to reliance on heuristic trials or expensive proxy training. To address this, we introduce \textbf{MergeMix}, a novel approach that efficiently determines optimal data mixing ratios by repurposing model merging weights as a high-fidelity, low-cost performance proxy. By training domain-specific experts on minimal tokens and optimizing their merging weights against downstream benchmarks, MergeMix effectively optimizes the performance of data mixtures without incurring the cost of full-scale training. Extensive experiments on models with 8B and 16B parameters validate that MergeMix achieves performance comparable to or surpassing exhaustive manual tuning while drastically reducing search costs. Furthermore, MergeMix exhibits high rank consistency (Spearman $\rho>0.9$) and strong cross-scale transferability, offering a scalable, automated solution for data mixture optimization.

Problem

Research questions and friction points this paper is trying to address.

data mixture optimization

large language models

training data composition

computational efficiency

model performance

Innovation

Methods, ideas, or system contributions that make the work stand out.

MergeMix

data mixture optimization

model merging