🤖 AI Summary
Existing vision-language models (VLMs) suffer from redundant cross-modal attention, weak semantic locality, and poor modality alignment—stemming from naive concatenation of image and text tokens and modality-agnostic positional encoding—leading to hallucination and performance degradation. To address this, we propose **cross-modal joint grouping** and a **parameter-free attention anchoring mechanism**: for the first time, visual and textual tokens are jointly clustered into shared semantic groups; within each group, text tokens semantically associated with visual patches serve as lightweight, interpretable semantic landmarks that guide native attention to model authentic cross-modal dependencies. Our method introduces no additional parameters, seamlessly integrates with pretrained large language model (LLM) architectures, and preserves inference efficiency. Evaluated across 15 benchmarks, it achieves significant improvements on 13 tasks, with up to 32% gain on reasoning and up to 15% reduction in hallucination rate. Notably, TinyLLaVA-1B—despite its minimal computational overhead—outperforms substantially larger models.
📝 Abstract
A fundamental reason for the dominance of attention over RNNs and LSTMs in LLMs is its ability to capture long-range dependencies by modeling direct interactions between all tokens, overcoming the sequential limitations of recurrent architectures. Similarly, a key reason why today's vision language models (VLMs) hallucinate and underperform pure language models is that they rely on direct concatenation of image and text tokens with a modality-blinded positional encoding, which conveniently adopts the pretrained LLM backbone but forces unnecessary long-distance attention between semantically related tokens across modalities. This underscores the urgent need for mechanisms that efficiently enhance token locality and cross-modal alignment. In response, we propose Attention Anchor, a parameter-free framework that efficiently groups semantically similar tokens across modalities, improving cross-modal locality. By inserting text tokens near relevant visual patches, we create semantic signposts that reveal true content-based cross-modal attention scores, guiding the model to focus on the correct image regions for tasks such as VQA, MMBench and POPE. This improves answer accuracy and reduces hallucinations without disrupting the prompt's semantic flow. AttAnchor achieves improvements across 13 out of 15 different metrics and benchmarks, including up to 32% gains on reasoning tasks and up to 15% improvements on hallucination benchmarks. AttAnchor enables TinyLLaVA 1B to outperform much larger models like LLaVA 7B and QwenVL 3B on POPE with only 0.1% inference time overhead. To the best of our knowledge, this work is among the first to investigate mixed-modal token grouping, where text and image tokens are clustered jointly into shared groups rather than being grouped within a single modality or merely aligned post-hoc with additional alignment losses.