Activating Distributed Visual Region within LLMs for Efficient and Effective Vision-Language Training and Inference

📅 2024-12-17

🏛️ arXiv.org

📈 Citations: 0

✨ Influential: 0

career value

231K/year

🤖 AI Summary

This work investigates whether large language models (LLMs) inherently contain brain-like, distributed “visual regions” and proposes a parameter-efficient training paradigm for vision-language models (VLMs). Methodologically, it first identifies cross-layer, sparsely distributed visual-sensitive regions within LLMs via task-driven visual response analysis—a novel approach. It then introduces a visual-region-guided hierarchical pruning and selective fine-tuning strategy, updating only 25% of LLM layers (uniformly distributed across depth) to achieve efficient optimization. Experiments demonstrate that the pruned models retain 99% of visual understanding capability across multiple architectures, while text generation performance improves; training time is significantly reduced, model size is substantially compressed, and accuracy remains stable. The core contributions are: (i) empirical evidence of intrinsic visual structure in LLMs, and (ii) the first sparse fine-tuning and compression framework explicitly grounded in inter-layer visual sensitivity modeling.

Technology Category

Application Category

📝 Abstract

Large Vision-Language Models (LVLMs) typically learn visual capacity through visual instruction tuning, involving updates to both a projector and their LLM backbones. Inspired by the concept of a visual region in the human brain, we investigate the existence of an analogous extit{visual region} within LLMs that functions as a cognitive core, and explore the potential of efficient training of LVLMs via selective layers tuning. Using Bunny-Llama-3-8B-V for detailed analysis and other three LVLMs for validation across diverse visual and textual tasks, we find that selectively updating 25% of LLMs layers, when sparsely and uniformly distributed, can preserve nearly 99% of visual performance and maintain or improve textual task results, while effectively reducing training time. Based on this targeted training approach, we further propose a novel visual region-based pruning paradigm, removing non-critical layers outside the visual region, which can achieve minimal performance loss. This study offers an effective and efficient strategy for LVLM training and inference by activating a layer-wise visual region within LLMs, which proves consistently effective across different models.

Problem

Research questions and friction points this paper is trying to address.

Identify visual regions in LLMs for efficient training

Selectively update layers to maintain performance and reduce time

Propose pruning non-critical layers to minimize performance loss

Innovation

Methods, ideas, or system contributions that make the work stand out.

Selectively updating sparse LLM layers

Visual region-based pruning paradigm

Activating layer-wise visual region

🔎 Similar Papers

Better Language Models Exhibit Higher Visual Alignment