CASA: Cross-Attention via Self-Attention for Efficient Vision-Language Fusion

📅 2025-12-22

📈 Citations: 0

✨ Influential: 0

career value

208K/year

🤖 AI Summary

To address the excessive memory and computational overhead incurred by full-token insertion in vision-language models (VLMs) for high-resolution images, long conversations, and streaming video, this paper proposes a lightweight fusion paradigm. Specifically, it introduces local text self-attention into cross-attention layers for the first time, enabling efficient co-modeling of visual and linguistic features. The approach preserves decoupling between the visual encoder and language model, significantly reducing GPU memory consumption and inference latency while mitigating the fine-grained visual understanding degradation commonly observed in pure cross-attention designs. Experiments demonstrate that our method achieves performance on par with full-token insertion on mainstream image understanding benchmarks. Moreover, it enables scalable, high-throughput, low-latency deployment for long-context tasks such as streaming video captioning—effectively balancing efficiency and accuracy.

Technology Category

Application Category

📝 Abstract

Vision-language models (VLMs) are commonly trained by inserting image tokens from a pretrained vision encoder into the textual stream of a language model. This allows text and image information to fully attend to one another within the model, but becomes extremely costly for high-resolution images, long conversations, or streaming videos, both in memory and compute. VLMs leveraging cross-attention are an efficient alternative to token insertion but exhibit a clear performance gap, in particular on tasks involving fine-grained visual details. We find that a key to improving such models is to also enable local text-to-text interaction in the dedicated cross-attention layers. Building on this, we propose CASA, Cross-Attention via Self-Attention, a simple and efficient paradigm which substantially reduces the gap with full token insertion on common image understanding benchmarks, while enjoying the same scalability as cross-attention models when applied to long-context multimodal tasks such as streaming video captioning. For samples and code, please see our project page at https://kyutai.org/casa .

Problem

Research questions and friction points this paper is trying to address.

Improves cross-attention VLMs for fine-grained visual tasks

Reduces computational cost of high-resolution multimodal fusion

Enables efficient long-context video processing like streaming captioning

Innovation

Methods, ideas, or system contributions that make the work stand out.

Cross-attention via self-attention for vision-language fusion

Enables local text-to-text interaction in cross-attention layers

Reduces performance gap with full token insertion efficiently

🔎 Similar Papers

Chrono: A Simple Blueprint for Representing Time in MLLMs