PlaM: Training-Free Plateau-Guided Model Merging for Better Visual Grounding in MLLMs

📅 2026-01-12

📈 Citations: 0

✨ Influential: 0

career value

190K/year

🤖 AI Summary

This work addresses the performance degradation in multimodal large language models (MLLMs) caused by instruction tuning, which often impairs their foundational text reasoning capabilities. The study uncovers a previously unobserved three-phase behavioral pattern in MLLMs—early modality separation, mid-stage alignment, and late-stage degradation—and proposes a training-free, plateau-guided model fusion method. By analyzing layer-wise visual token masking, the approach selectively injects parameters from the base language model to enhance visual grounding without compromising linguistic competence. Evaluated across five prominent MLLMs and nine benchmarks, the method consistently yields significant performance gains. Attention analysis further reveals that the fused models exhibit sharper focus on task-relevant visual regions, demonstrating improved multimodal alignment and reasoning fidelity.

Technology Category

Application Category

📝 Abstract

Multimodal Large Language Models (MLLMs) rely on strong linguistic reasoning inherited from their base language models. However, multimodal instruction fine-tuning paradoxically degrades this text's reasoning capability, undermining multimodal performance. To address this issue, we propose a training-free framework to mitigate this degradation. Through layer-wise vision token masking, we reveal a common three-stage pattern in multimodal large language models: early-modal separation, mid-modal alignment, and late-modal degradation. By analyzing the behavior of MLLMs at different stages, we propose a plateau-guided model merging method that selectively injects base language model parameters into MLLMs. Experimental results based on five MLLMs on nine benchmarks demonstrate the effectiveness of our method. Attention-based analysis further reveals that merging shifts attention from diffuse, scattered patterns to focused localization on task-relevant visual regions. Our repository is on https://github.com/wzj1718/PlaM.

Problem

Research questions and friction points this paper is trying to address.

Multimodal Large Language Models

instruction fine-tuning

reasoning degradation

visual grounding

model merging

Innovation

Methods, ideas, or system contributions that make the work stand out.

training-free

model merging

plateau-guided