🤖 AI Summary
Existing large vision-language models (LVLMs) suffer from inefficient vision–language fusion and high computational overhead, often relying on visual token concatenation or extended context windows—compromising linguistic priors and scalability. To address this, we propose a **vision-conditioned dynamic modulation mechanism for LayerNorm (LN) parameters**, which injects visual information token-wise into LN’s affine parameters via lightweight adapters, enabling cross-modal alignment without modifying the backbone architecture or extending textual context. Our method integrates a multi-stage visual encoder and achieves state-of-the-art performance across 15 image and video benchmarks. Compared to LLaVA-OV-7B, it reduces FLOPs by 94.0%, accelerates inference by 3.1×, halves GPU memory consumption, and enables real-time multimodal inference.
📝 Abstract
Despite the impressive advancements of Large Vision-Language Models (LVLMs), existing approaches suffer from a fundamental bottleneck: inefficient visual-language integration. Current methods either disrupt the model's inherent structure or introduce severe long-context computational burden, severely limiting scalability and efficiency. In this paper, we rethink multimodal integration and present LaVi, a novel LVLM that enables seamless and efficient vision-language fusion through internal feature modulation within the Large Language Models (LLMs). Unlike dominant LVLMs that rely on visual token concatenation, LaVi bypasses long-context expansion by introducing a lightweight and adaptive transformation, which incorporates visual context by injecting token-wise vision-conditioned deltas into the affine parameters of layer normalization. This mechanism directly modulates linguistic hidden states based on visual input, ensuring precise vision-language alignment while preserving the LLM's linguistic priors and drastically reducing computational costs. Extensive evaluations across 15 image and video benchmarks demonstrate that LaVi not only achieves state-of-the-art multimodal performance but also dramatically enhances efficiency. Compared to LLaVA-OV-7B, LaVi reduces FLOPs by 94.0%, improves inference speed by 3.1 times, and cuts memory usage in half - establishing LaVi as a scalable and practical solution for real-time multimodal reasoning. The code and models will be released soon.