ViCA: Efficient Multimodal LLMs with Vision-Only Cross-Attention

📅 2026-02-07
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the high computational cost incurred by current large vision-language models, which uniformly apply self-attention over both visual and textual tokens in every Transformer layer. To mitigate this inefficiency, the authors propose ViCA, a novel architecture that leverages the inherent alignment between visual embeddings and the language space. In ViCA, visual tokens bypass all self-attention and feed-forward layers, interacting with textual tokens only through sparse cross-attention at a few selected layers. This design drastically reduces visual-side computation while enabling hardware-friendly, efficient inference and remaining orthogonal to token pruning techniques. Evaluated across three backbone models and nine benchmarks, ViCA achieves 98% of the original accuracy using only 4% of the visual computation, yielding over 3.5× speedup in single-batch inference and more than 10× acceleration in multi-batch settings.

Technology Category

Application Category

📝 Abstract
Modern multimodal large language models (MLLMs) adopt a unified self-attention design that processes visual and textual tokens at every Transformer layer, incurring substantial computational overhead. In this work, we revisit the necessity of such dense visual processing and show that projected visual embeddings are already well-aligned with the language space, while effective vision-language interaction occurs in only a small subset of layers. Based on these insights, we propose ViCA (Vision-only Cross-Attention), a minimal MLLM architecture in which visual tokens bypass all self-attention and feed-forward layers, interacting with text solely through sparse cross-attention at selected layers. Extensive evaluations across three MLLM backbones, nine multimodal benchmarks, and 26 pruning-based baselines show that ViCA preserves 98% of baseline accuracy while reducing visual-side computation to 4%, consistently achieving superior performance-efficiency trade-offs. Moreover, ViCA provides a regular, hardware-friendly inference pipeline that yields over 3.5x speedup in single-batch inference and over 10x speedup in multi-batch inference, reducing visual grounding to near-zero overhead compared with text-only LLMs. It is also orthogonal to token pruning methods and can be seamlessly combined for further efficiency gains. Our code is available at https://github.com/EIT-NLP/ViCA.
Problem

Research questions and friction points this paper is trying to address.

multimodal LLMs
computational overhead
vision-language interaction
efficient inference
visual processing
Innovation

Methods, ideas, or system contributions that make the work stand out.

Vision-only Cross-Attention
Multimodal LLMs
Efficient Inference
Sparse Cross-Attention
Computation Reduction
🔎 Similar Papers
No similar papers found.
Wenjie Liu
Wenjie Liu
Harbin Institute of Technology
Spectral methodSingularityNosmooth domainhp-FEM
H
Hao Wu
Ningbo Key Laboratory of Spatial Intelligence and Digital Derivative, Institute of Digital Twin, Eastern Institute of Technology, Ningbo
Xin Qiu
Xin Qiu
Cognizant AI Labs
Neural Architecture SearchUncertainty QuantificationEvolutionary Computation
Y
Yingqi Fan
Ningbo Key Laboratory of Spatial Intelligence and Digital Derivative, Institute of Digital Twin, Eastern Institute of Technology, Ningbo
Y
Yihan Zhang
Ningbo Key Laboratory of Spatial Intelligence and Digital Derivative, Institute of Digital Twin, Eastern Institute of Technology, Ningbo
A
Anhao Zhao
Ningbo Key Laboratory of Spatial Intelligence and Digital Derivative, Institute of Digital Twin, Eastern Institute of Technology, Ningbo
Yunpu Ma
Yunpu Ma
Ludwig Maximilian University of Munich
Foundation ModelsAgentic AITemporal Knowledge GraphQuantum AI
Xiaoyu Shen
Xiaoyu Shen
Eastern Institute of Technology, Ningbo
language modelmulti-modal learningreasoning