Sparse Shortcuts: Facilitating Efficient Fusion in Multimodal Large Language Models

📅 2026-01-31
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Current vision-language models typically rely solely on high-level visual features, neglecting mid- and low-level semantic information, which limits their cross-modal understanding capabilities. This work proposes SparseCut, a novel architecture that introduces, for the first time, a sparse shortcut connection mechanism to enable efficient hierarchical fusion of multi-level visual features between the cross-modal encoder and the large language model. Additionally, it incorporates a multi-granularity feature fusion module that enhances semantic alignment without increasing input sequence length or computational overhead. The proposed approach effectively balances fusion depth and efficiency, is compatible with diverse foundation large language models, and achieves significant performance gains across multiple multimodal benchmarks, demonstrating strong generalizability and scalability.

Technology Category

Application Category

📝 Abstract
With the remarkable success of large language models (LLMs) in natural language understanding and generation, multimodal large language models (MLLMs) have rapidly advanced in their ability to process data across multiple modalities. While most existing efforts focus on scaling up language models or constructing higher-quality training data, limited attention has been paid to effectively integrating cross-modal knowledge into the language space. In vision-language models, for instance, aligning modalities using only high-level visual features often discards the rich semantic information present in mid- and low-level features, limiting the model's ability of cross-modality understanding. To address this issue, we propose SparseCut, a general cross-modal fusion architecture for MLLMs, introducing sparse shortcut connections between the cross-modal encoder and the LLM. These shortcut connections enable the efficient and hierarchical integration of visual features at multiple levels, facilitating richer semantic fusion without increasing computational overhead. We further introduce an efficient multi-grained feature fusion module, which performs the fusion of visual features before routing them through the shortcuts. This preserves the original language context and does not increase the overall input length, thereby avoiding an increase in computational complexity for the LLM. Experiments demonstrate that SparseCut significantly enhances the performance of MLLMs across various multimodal benchmarks with generality and scalability for different base LLMs.
Problem

Research questions and friction points this paper is trying to address.

multimodal large language models
cross-modal fusion
visual features
semantic integration
vision-language alignment
Innovation

Methods, ideas, or system contributions that make the work stand out.

Sparse Shortcuts
Multimodal Fusion
Hierarchical Feature Integration
Efficient MLLMs
Cross-modal Alignment
🔎 Similar Papers