Bring Reason to Vision: Understanding Perception and Reasoning through Model Merging

📅 2025-05-08

📈 Citations: 0

✨ Influential: 0

career value

218K/year

🤖 AI Summary

This study investigates the hierarchical distribution and fusion mechanisms of visual perception and linguistic reasoning capabilities in vision-language models (VLMs). To address the challenge of integrating high-level reasoning into VLMs without fine-tuning, we propose the first cross-modal, parameter-level model merging method that injects reasoning capabilities from large language models (LLMs) into VLMs zero-shot. Through layer-wise capability attribution analysis, we find that perceptual capabilities concentrate in shallow layers, while reasoning relies on middle-to-deep layers; after merging, all layers participate in reasoning, yet perceptual representations remain unchanged across layers. Quantitative experiments validate the hierarchical encoding of perception versus reasoning and demonstrate effective, fine-tuning-free reasoning transfer. Our work reveals a structured organization of multimodal capabilities—where perception and reasoning are modularly segregated yet integrable—and establishes a new paradigm for interpretable, modular multimodal model integration.

Technology Category

Application Category

📝 Abstract

Vision-Language Models (VLMs) combine visual perception with the general capabilities, such as reasoning, of Large Language Models (LLMs). However, the mechanisms by which these two abilities can be combined and contribute remain poorly understood. In this work, we explore to compose perception and reasoning through model merging that connects parameters of different models. Unlike previous works that often focus on merging models of the same kind, we propose merging models across modalities, enabling the incorporation of the reasoning capabilities of LLMs into VLMs. Through extensive experiments, we demonstrate that model merging offers a successful pathway to transfer reasoning abilities from LLMs to VLMs in a training-free manner. Moreover, we utilize the merged models to understand the internal mechanism of perception and reasoning and how merging affects it. We find that perception capabilities are predominantly encoded in the early layers of the model, whereas reasoning is largely facilitated by the middle-to-late layers. After merging, we observe that all layers begin to contribute to reasoning, whereas the distribution of perception abilities across layers remains largely unchanged. These observations shed light on the potential of model merging as a tool for multimodal integration and interpretation.

Problem

Research questions and friction points this paper is trying to address.

Understanding how perception and reasoning combine in Vision-Language Models

Exploring model merging to transfer reasoning from LLMs to VLMs

Analyzing layer-wise contributions to perception and reasoning post-merging

Innovation

Methods, ideas, or system contributions that make the work stand out.

Merge models across modalities for reasoning transfer

Training-free reasoning transfer from LLMs to VLMs

Analyze layer contributions to perception and reasoning

🔎 Similar Papers

What is the Visual Cognition Gap between Humans and Multimodal LLMs?