Activating Distributed Visual Region within LLMs for Efficient and Effective Vision-Language Training and Inference

📅 2024-12-17
🏛️ arXiv.org
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work investigates whether large language models (LLMs) inherently contain brain-like, distributed “visual regions” and proposes a parameter-efficient training paradigm for vision-language models (VLMs). Methodologically, it first identifies cross-layer, sparsely distributed visual-sensitive regions within LLMs via task-driven visual response analysis—a novel approach. It then introduces a visual-region-guided hierarchical pruning and selective fine-tuning strategy, updating only 25% of LLM layers (uniformly distributed across depth) to achieve efficient optimization. Experiments demonstrate that the pruned models retain 99% of visual understanding capability across multiple architectures, while text generation performance improves; training time is significantly reduced, model size is substantially compressed, and accuracy remains stable. The core contributions are: (i) empirical evidence of intrinsic visual structure in LLMs, and (ii) the first sparse fine-tuning and compression framework explicitly grounded in inter-layer visual sensitivity modeling.

Technology Category

Application Category

📝 Abstract
Large Vision-Language Models (LVLMs) typically learn visual capacity through visual instruction tuning, involving updates to both a projector and their LLM backbones. Inspired by the concept of a visual region in the human brain, we investigate the existence of an analogous extit{visual region} within LLMs that functions as a cognitive core, and explore the potential of efficient training of LVLMs via selective layers tuning. Using Bunny-Llama-3-8B-V for detailed analysis and other three LVLMs for validation across diverse visual and textual tasks, we find that selectively updating 25% of LLMs layers, when sparsely and uniformly distributed, can preserve nearly 99% of visual performance and maintain or improve textual task results, while effectively reducing training time. Based on this targeted training approach, we further propose a novel visual region-based pruning paradigm, removing non-critical layers outside the visual region, which can achieve minimal performance loss. This study offers an effective and efficient strategy for LVLM training and inference by activating a layer-wise visual region within LLMs, which proves consistently effective across different models.
Problem

Research questions and friction points this paper is trying to address.

Identify visual regions in LLMs for efficient training
Selectively update layers to maintain performance and reduce time
Propose pruning non-critical layers to minimize performance loss
Innovation

Methods, ideas, or system contributions that make the work stand out.

Selectively updating sparse LLM layers
Visual region-based pruning paradigm
Activating layer-wise visual region
🔎 Similar Papers
No similar papers found.
S
Siyuan Wang
University of Southern California
Dianyi Wang
Dianyi Wang
Fudan University&&Shanghai Innovation Institute
Multi-modal Learning
C
Chengxing Zhou
Sun Yat-sen University
Zejun Li
Zejun Li
Fudan University
vision-languagemulti-modality
Zhihao Fan
Zhihao Fan
Qwen Team; Fudan University
LVLMAgent
X
Xuanjing Huang
Fudan University
Z
Zhongyu Wei
Fudan University