🤖 AI Summary
This work investigates whether large language models (LLMs) inherently contain brain-like, distributed “visual regions” and proposes a parameter-efficient training paradigm for vision-language models (VLMs). Methodologically, it first identifies cross-layer, sparsely distributed visual-sensitive regions within LLMs via task-driven visual response analysis—a novel approach. It then introduces a visual-region-guided hierarchical pruning and selective fine-tuning strategy, updating only 25% of LLM layers (uniformly distributed across depth) to achieve efficient optimization. Experiments demonstrate that the pruned models retain 99% of visual understanding capability across multiple architectures, while text generation performance improves; training time is significantly reduced, model size is substantially compressed, and accuracy remains stable. The core contributions are: (i) empirical evidence of intrinsic visual structure in LLMs, and (ii) the first sparse fine-tuning and compression framework explicitly grounded in inter-layer visual sensitivity modeling.
📝 Abstract
Large Vision-Language Models (LVLMs) typically learn visual capacity through visual instruction tuning, involving updates to both a projector and their LLM backbones. Inspired by the concept of a visual region in the human brain, we investigate the existence of an analogous extit{visual region} within LLMs that functions as a cognitive core, and explore the potential of efficient training of LVLMs via selective layers tuning. Using Bunny-Llama-3-8B-V for detailed analysis and other three LVLMs for validation across diverse visual and textual tasks, we find that selectively updating 25% of LLMs layers, when sparsely and uniformly distributed, can preserve nearly 99% of visual performance and maintain or improve textual task results, while effectively reducing training time. Based on this targeted training approach, we further propose a novel visual region-based pruning paradigm, removing non-critical layers outside the visual region, which can achieve minimal performance loss. This study offers an effective and efficient strategy for LVLM training and inference by activating a layer-wise visual region within LLMs, which proves consistently effective across different models.