RAVE: Re-Allocating Visual Attention in Large Multimodal Models

πŸ“… 2026-05-18
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF

career value

232K/year
πŸ€– AI Summary
This work addresses the challenges of cross-modal attention misalignment between vision and text, as well as imbalanced intra-visual token attention, commonly observed in large-scale multimodal models employing standard self-attention mechanisms. To mitigate these issues, the authors propose RAVEβ€”a lightweight pairwise gating mechanism that introduces learnable attention biases based on pre-RoPE query and key features before the softmax operation, enabling dynamic redistribution of visual attention. RAVE is designed to be seamlessly integrated without modifying the backbone architecture and supports end-to-end training, significantly enhancing visual grounding capabilities. Empirical evaluations demonstrate consistent performance gains across multiple multimodal benchmarks, with an average improvement of 3 points, particularly excelling in perception-intensive tasks such as multilingual OCR, chart understanding, and document-based visual question answering.
πŸ“ Abstract
Large multimodal models (LMMs) inherit the self-attention mechanism of pretrained language backbones, yet standard attention can exhibit suboptimal allocation, including cross-modal misallocation between textual and visual evidence and intra-visual imbalance among visual tokens. We propose RAVE (Re-Allocating Visual Attention), a lightweight pair-gating mechanism that adds a learned query--key bias to pre-softmax attention scores over visual keys, derived from pre-RoPE query and key features. RAVE requires no architectural modification to the backbone and can be trained end-to-end with the rest of the model. Across a suite of multimodal benchmarks, RAVE improves over standard attention by an average of 3 points, with the largest gains on perception-intensive tasks -- including multilingual OCR, chart understanding, document VQA, and scene text VQA -- where accurate visual grounding is critical.
Problem

Research questions and friction points this paper is trying to address.

visual attention
cross-modal misallocation
intra-visual imbalance
multimodal models
attention allocation
Innovation

Methods, ideas, or system contributions that make the work stand out.

visual attention reallocation
multimodal models
pair-gating mechanism
cross-modal alignment
visual grounding