Look Twice: Training-Free Evidence Highlighting in Multimodal Large Language Models

πŸ“… 2026-04-01
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
This work addresses the challenge that multimodal large language models often struggle to accurately localize relevant image regions and effectively leverage textual evidence when answering knowledge-intensive visual questions. To this end, the authors propose a training-free, architecture-agnostic inference-time framework that identifies critical visual and textual evidence by analyzing the model’s internal attention mechanisms. Key evidence is dynamically reweighted through lightweight prompt tokens to guide the generation process toward highly relevant information. This approach achieves the first demonstration of dynamic highlighting of multimodal evidence, significantly outperforming zero-shot baselines across multiple knowledge-based visual question answering benchmarks. Notably, even without external textual input, performance improves and hallucination is reduced solely through visual evidence highlighting.
πŸ“ Abstract
Answering questions about images often requires combining visual understanding with external knowledge. Multimodal Large Language Models (MLLMs) provide a natural framework for this setting, but they often struggle to identify the most relevant visual and textual evidence when answering knowledge-intensive queries. In such scenarios, models must integrate visual cues with retrieved textual evidence that is often noisy or only partially relevant, while also localizing fine-grained visual information in the image. In this work, we introduce Look Twice (LoT), a training-free inference-time framework that improves how pretrained MLLMs utilize multimodal evidence. Specifically, we exploit the model attention patterns to estimate which visual regions and retrieved textual elements are relevant to a query, and then generate the answer conditioned on this highlighted evidence. The selected cues are highlighted through lightweight prompt-level markers that encourage the model to re-attend to the relevant evidence during generation. Experiments across multiple knowledge-based VQA benchmarks show consistent improvements over zero-shot MLLMs. Additional evaluations on vision-centric and hallucination-oriented benchmarks further demonstrate that visual evidence highlighting alone improves model performance in settings without textual context, all without additional training or architectural modifications. Source code will be publicly released.
Problem

Research questions and friction points this paper is trying to address.

Multimodal Large Language Models
visual evidence highlighting
knowledge-intensive VQA
retrieved textual evidence
fine-grained visual localization
Innovation

Methods, ideas, or system contributions that make the work stand out.

training-free
evidence highlighting
multimodal large language models
attention-based selection
visual question answering