🤖 AI Summary
This work addresses the performance degradation in multimodal large language models (MLLMs) caused by instruction tuning, which often impairs their foundational text reasoning capabilities. The study uncovers a previously unobserved three-phase behavioral pattern in MLLMs—early modality separation, mid-stage alignment, and late-stage degradation—and proposes a training-free, plateau-guided model fusion method. By analyzing layer-wise visual token masking, the approach selectively injects parameters from the base language model to enhance visual grounding without compromising linguistic competence. Evaluated across five prominent MLLMs and nine benchmarks, the method consistently yields significant performance gains. Attention analysis further reveals that the fused models exhibit sharper focus on task-relevant visual regions, demonstrating improved multimodal alignment and reasoning fidelity.
📝 Abstract
Multimodal Large Language Models (MLLMs) rely on strong linguistic reasoning inherited from their base language models. However, multimodal instruction fine-tuning paradoxically degrades this text's reasoning capability, undermining multimodal performance. To address this issue, we propose a training-free framework to mitigate this degradation. Through layer-wise vision token masking, we reveal a common three-stage pattern in multimodal large language models: early-modal separation, mid-modal alignment, and late-modal degradation. By analyzing the behavior of MLLMs at different stages, we propose a plateau-guided model merging method that selectively injects base language model parameters into MLLMs. Experimental results based on five MLLMs on nine benchmarks demonstrate the effectiveness of our method. Attention-based analysis further reveals that merging shifts attention from diffuse, scattered patterns to focused localization on task-relevant visual regions. Our repository is on https://github.com/wzj1718/PlaM.