From One-to-One to Many-to-Many: Dynamic Cross-Layer Injection for Deep Vision-Language Fusion

📅 2026-01-15
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the limitation of existing vision-language models, which typically employ static unidirectional connections that inject visual encoder outputs at a single point into large language models, thereby restricting deep integration of multi-granular visual semantics with linguistic reasoning. To overcome this, the authors propose the Cross-Layer Injection (CLI) framework, introducing a dynamic many-to-many cross-modal connectivity paradigm. CLI enables the large language model to contextually and adaptively select and fuse visual features from all layers during decoding, facilitated by Adaptive Multi-Projection (AMP) and Adaptive Gated Fusion (AGF) mechanisms. Experiments on LLaVA-OneVision and LLaVA-1.5 demonstrate that CLI significantly improves performance across 18 benchmark datasets, validating its effectiveness and generalizability.

Technology Category

Application Category

📝 Abstract
Vision-Language Models (VLMs) create a severe visual feature bottleneck by using a crude, asymmetric connection that links only the output of the vision encoder to the input of the large language model (LLM). This static architecture fundamentally limits the ability of LLMs to achieve comprehensive alignment with hierarchical visual knowledge, compromising their capacity to accurately integrate local details with global semantics into coherent reasoning. To resolve this, we introduce Cross-Layer Injection (CLI), a novel and lightweight framework that forges a dynamic many-to-many bridge between the two modalities. CLI consists of two synergistic, parameter-efficient components: an Adaptive Multi-Projection (AMP) module that harmonizes features from diverse vision layers, and an Adaptive Gating Fusion (AGF) mechanism that empowers the LLM to selectively inject the most relevant visual information based on its real-time decoding context. We validate the effectiveness and versatility of CLI by integrating it into LLaVA-OneVision and LLaVA-1.5. Extensive experiments on 18 diverse benchmarks demonstrate significant performance improvements, establishing CLI as a scalable paradigm that unlocks deeper multimodal understanding by granting LLMs on-demand access to the full visual hierarchy.
Problem

Research questions and friction points this paper is trying to address.

Vision-Language Models
visual feature bottleneck
hierarchical visual knowledge
multimodal alignment
static architecture
Innovation

Methods, ideas, or system contributions that make the work stand out.

Cross-Layer Injection
Vision-Language Models
Adaptive Multi-Projection
Adaptive Gating Fusion
Dynamic Multimodal Fusion
🔎 Similar Papers
No similar papers found.