Vision Function Layer in Multimodal LLMs

📅 2025-09-29
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This study identifies that visual capabilities—such as counting, object localization, and OCR—in multimodal large language models (MLLMs) are not uniformly distributed across layers but are instead concentrated in specific decoder layers, termed “Visual Function Layers” (VFLs). These VFLs exhibit consistent depth ordering across diverse MLLMs and closely align with human visual cognitive hierarchies. To systematically characterize this phenomenon, we propose the Visual Token Swapping (VTS) framework, which quantifies layer-wise visual contribution via KV cache intervention and enables robust VFL identification. Building on this insight, we introduce two novel techniques: (1) VFL-LoRA, a parameter-efficient fine-tuning strategy that optimizes only VFLs, and (2) VFL-select, a data selection mechanism prioritizing samples most informative for VFL learning. Experiments demonstrate that VFL-LoRA surpasses full-parameter fine-tuning in performance while preserving functional integrity; VFL-select achieves 98% of full-data performance using only 20% training data—substantially improving efficiency, interpretability, and robustness.

Technology Category

Application Category

📝 Abstract
This study identifies that visual-related functional decoding is distributed across different decoder layers in Multimodal Large Language Models (MLLMs). Typically, each function, such as counting, grounding, or OCR recognition, narrows down to two or three layers, which we define as Vision Function Layers (VFL). Additionally, the depth and its order of different VFLs exhibits a consistent pattern across different MLLMs, which is well-aligned with human behaviors (e.g., recognition occurs first, followed by counting, and then grounding). These findings are derived from Visual Token Swapping, our novel analytical framework that modifies targeted KV cache entries to precisely elucidate layer-specific functions during decoding. Furthermore, these insights offer substantial utility in tailoring MLLMs for real-world downstream applications. For instance, when LoRA training is selectively applied to VFLs whose functions align with the training data, VFL-LoRA not only outperform full-LoRA but also prevent out-of-domain function forgetting. Moreover, by analyzing the performance differential on training data when particular VFLs are ablated, VFL-select automatically classifies data by function, enabling highly efficient data selection to directly bolster corresponding capabilities. Consequently, VFL-select surpasses human experts in data selection, and achieves 98% of full-data performance with only 20% of the original dataset. This study delivers deeper comprehension of MLLM visual processing, fostering the creation of more efficient, interpretable, and robust models.
Problem

Research questions and friction points this paper is trying to address.

Identifies vision function layers for specific visual tasks in MLLMs
Reveals consistent layer processing patterns across different multimodal models
Develops methods to optimize model training and data selection efficiency
Innovation

Methods, ideas, or system contributions that make the work stand out.

Vision Function Layers identify layer-specific visual tasks
Visual Token Swapping analyzes functions via KV cache modification
VFL-LoRA selectively trains layers to enhance performance