Unifying Multimodal Large Language Model Capabilities and Modalities via Model Merging

📅 2025-05-26

📈 Citations: 0

✨ Influential: 0

career value

199K/year

🤖 AI Summary

A standardized, integrated benchmark for training and evaluating multimodal large language models (MLLMs) is lacking, hindering systematic research on cross-modal model merging. Method: We introduce the first dedicated MLLM model merging benchmark, covering diverse tasks—including visual question answering, geometric reasoning, chart understanding, OCR, and localization—and supporting both LoRA and full-parameter fine-tuned models. We systematically investigate vision–language, audio–language, and video–language fusion pathways. Furthermore, we propose a novel fusion method integrating task vector denoising, interactive loss optimization, LoRA adapter fusion, and multimodal alignment. Results: Our approach achieves an average performance gain of 2.48% across benchmarks, surpassing unimodal expert models without additional training data. This constitutes the first empirical demonstration that multimodal complementarity can yield stronger, general-purpose Omni-language models.

Technology Category

Application Category

📝 Abstract

While foundation models update slowly due to resource-intensive training requirements, domain-specific models evolve between updates. Model merging aims to combine multiple expert models into a single, more capable model, thereby reducing storage and serving costs while supporting decentralized model development. Despite its potential, previous studies have primarily focused on merging visual classification models or Large Language Models (LLMs) for code and math tasks. Multimodal Large Language Models (MLLMs), which extend the capabilities of LLMs through large-scale multimodal training, have gained traction. However, there lacks a benchmark for model merging research that clearly divides the tasks for MLLM training and evaluation. In this paper, (i) we introduce the model merging benchmark for MLLMs, which includes multiple tasks such as VQA, Geometry, Chart, OCR, and Grounding, providing both LoRA and full fine-tuning models. Moreover, we explore how model merging can combine different modalities (e.g., vision-language, audio-language, and video-language models), moving toward the Omni-language model. (ii) We implement 10 model merging algorithms on the benchmark. Furthermore, we propose a novel method that removes noise from task vectors and robustly optimizes the merged vector based on a loss defined over task vector interactions, achieving an average performance gain of 2.48%. (iii) We find that model merging offers a promising way for building improved MLLMs without requiring data training. Our results also demonstrate that the complementarity among multiple modalities outperforms individual modalities.

Problem

Research questions and friction points this paper is trying to address.

Lack of benchmark for merging multimodal large language models

Need to combine different modalities into unified model

Improving model merging algorithms for better performance

Innovation

Methods, ideas, or system contributions that make the work stand out.

Introduces MLLM merging benchmark with diverse tasks

Proposes noise-free task vector optimization method

Demonstrates multimodal complementarity boosts performance

🔎 Similar Papers

No similar papers found.