Linear Model Merging Unlocks Simple and Scalable Multimodal Data Mixture Optimization

📅 2026-02-04
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the high cost and combinatorial explosion associated with optimizing data mixing ratios in cross-domain supervised fine-tuning of multimodal large language models. The authors propose an innovative approach that constructs a surrogate model by linearly merging parameters from domain-specific expert models, enabling efficient prediction of performance under various mixing strategies without repeated training. This method pioneers the use of model merging as a proxy mechanism for data mixture optimization, effectively decoupling performance evaluation from actual model training. Evaluated across 14 benchmarks, the surrogate model exhibits strong rank correlation with the true performance of mixed models, substantially improving search efficiency and scalability. The approach offers a low-cost, high-efficiency solution for optimizing multimodal fine-tuning pipelines.

Technology Category

Application Category

📝 Abstract
Selecting the best data mixture is critical for successful Supervised Fine-Tuning (SFT) of Multimodal Large Language Models. However, determining the optimal mixture weights across multiple domain-specific datasets remains a significant bottleneck due to the combinatorial search space and the high cost associated with even a single training run. This is the so-called Data Mixture Optimization (DMO) problem. On the other hand, model merging unifies domain-specific experts through parameter interpolation. This strategy is efficient, as it only requires a single training run per domain, yet oftentimes leads to suboptimal models. In this work, we take the best of both worlds, studying model merging as an efficient strategy for estimating the performance of different data mixtures. We train domain-specific multimodal experts and evaluate their weighted parameter-space combinations to estimate the efficacy of corresponding data mixtures. We conduct extensive experiments on 14 multimodal benchmarks, and empirically demonstrate that the merged proxy models exhibit a high rank correlation with models trained on actual data mixtures. This decouples the search for optimal mixtures from the resource-intensive training process, thereby providing a scalable and efficient strategy for navigating the complex landscape of mixture weights. Code is publicly available at https://github.com/BerasiDavide/mLLMs_merging_4_DMO.
Problem

Research questions and friction points this paper is trying to address.

Data Mixture Optimization
Supervised Fine-Tuning
Multimodal Large Language Models
Mixture Weights
Combinatorial Search
Innovation

Methods, ideas, or system contributions that make the work stand out.

Linear Model Merging
Data Mixture Optimization
Multimodal LLMs
Parameter Interpolation
Supervised Fine-Tuning