LaVi: Efficient Large Vision-Language Models via Internal Feature Modulation

📅 2025-06-20
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing large vision-language models (LVLMs) suffer from inefficient vision–language fusion and high computational overhead, often relying on visual token concatenation or extended context windows—compromising linguistic priors and scalability. To address this, we propose a **vision-conditioned dynamic modulation mechanism for LayerNorm (LN) parameters**, which injects visual information token-wise into LN’s affine parameters via lightweight adapters, enabling cross-modal alignment without modifying the backbone architecture or extending textual context. Our method integrates a multi-stage visual encoder and achieves state-of-the-art performance across 15 image and video benchmarks. Compared to LLaVA-OV-7B, it reduces FLOPs by 94.0%, accelerates inference by 3.1×, halves GPU memory consumption, and enables real-time multimodal inference.

Technology Category

Application Category

📝 Abstract
Despite the impressive advancements of Large Vision-Language Models (LVLMs), existing approaches suffer from a fundamental bottleneck: inefficient visual-language integration. Current methods either disrupt the model's inherent structure or introduce severe long-context computational burden, severely limiting scalability and efficiency. In this paper, we rethink multimodal integration and present LaVi, a novel LVLM that enables seamless and efficient vision-language fusion through internal feature modulation within the Large Language Models (LLMs). Unlike dominant LVLMs that rely on visual token concatenation, LaVi bypasses long-context expansion by introducing a lightweight and adaptive transformation, which incorporates visual context by injecting token-wise vision-conditioned deltas into the affine parameters of layer normalization. This mechanism directly modulates linguistic hidden states based on visual input, ensuring precise vision-language alignment while preserving the LLM's linguistic priors and drastically reducing computational costs. Extensive evaluations across 15 image and video benchmarks demonstrate that LaVi not only achieves state-of-the-art multimodal performance but also dramatically enhances efficiency. Compared to LLaVA-OV-7B, LaVi reduces FLOPs by 94.0%, improves inference speed by 3.1 times, and cuts memory usage in half - establishing LaVi as a scalable and practical solution for real-time multimodal reasoning. The code and models will be released soon.
Problem

Research questions and friction points this paper is trying to address.

Inefficient visual-language integration in LVLMs
Disruption of model structure or high computational burden
Need for scalable, efficient multimodal fusion solution
Innovation

Methods, ideas, or system contributions that make the work stand out.

Internal feature modulation for vision-language fusion
Lightweight adaptive transformation reduces computational burden
Token-wise vision-conditioned deltas enhance alignment
🔎 Similar Papers
Tongtian Yue
Tongtian Yue
Institute of Automation, Chinese Academy of Sciences
Multimodal PretrainVisual-Language
L
Longteng Guo
Institute of Automation, Chinese Academy of Sciences; School of Artificial Intelligence, University of Chinese Academy of Sciences
Yepeng Tang
Yepeng Tang
Beijing Jiaotong University
VideoLLMVideo Understanding
Zijia Zhao
Zijia Zhao
Institute of Automation, Chinese Academy Sciences (CASIA)
Multimodal learning
X
Xinxin Zhu
Institute of Automation, Chinese Academy of Sciences; School of Artificial Intelligence, University of Chinese Academy of Sciences
H
Hua Huang
School of Artificial Intelligence, Beijing Normal University
J
Jing Liu
Institute of Automation, Chinese Academy of Sciences; School of Artificial Intelligence, University of Chinese Academy of Sciences