Breaking the Visual Shortcuts in Multimodal Knowledge-Based Visual Question Answering

📅 2025-11-27

📈 Citations: 0

✨ Influential: 0

career value

170K/year

🤖 AI Summary

Existing MKB-VQA benchmarks suffer from “visual shortcut” biases, where models answer correctly by exploiting strong visual associations between images and primary document entities—bypassing genuine multimodal reasoning. Method: We propose RETINA, a shortcut-free benchmark that constructs image-text pairs anchored on secondary entities (rather than primary ones), thereby enforcing cross-modal semantic alignment and multi-hop reasoning. To support this, we design MIMIR—a retrieval-augmented VQA model that enhances document representations via multi-image grounding and employs LLM-driven automated data construction alongside contrastive learning to optimize cross-modal alignment. Contribution/Results: Experiments show substantial performance drops for mainstream models on RETINA, confirming their reliance on visual shortcuts. In contrast, MIMIR significantly outperforms baselines, demonstrating improved robustness and deeper multimodal understanding. RETINA thus provides a more rigorous evaluation framework, while MIMIR establishes a new state-of-the-art in knowledge-grounded VQA with enhanced reasoning fidelity.

Technology Category

Application Category

📝 Abstract

Existing Multimodal Knowledge-Based Visual Question Answering (MKB-VQA) benchmarks suffer from "visual shortcuts", as the query image typically matches the primary subject entity of the target document. We demonstrate that models can exploit these shortcuts, achieving comparable results using visual cues alone. To address this, we introduce Relational Entity Text-Image kNowledge Augmented (RETINA) benchmark, automatically constructed using an LLM-driven pipeline, consisting of 120k training and 2k human-curated test set. RETINA contains queries referencing secondary subjects (i.e. related entities) and pairs them with images of these related entities, removing the visual shortcut. When evaluated on RETINA existing models show significantly degraded performance, confirming their reliance on the shortcut. Furthermore, we propose Multi-Image MultImodal Retriever (MIMIR), which enriches document embeddings by augmenting images of multiple related entities, effectively handling RETINA, unlike prior work that uses only a single image per document. Our experiments validate the limitations of existing benchmarks and demonstrate the effectiveness of RETINA and MIMIR. Our project is available at: Project Page.

Problem

Research questions and friction points this paper is trying to address.

Addresses visual shortcuts in MKB-VQA benchmarks

Introduces RETINA to remove reliance on primary entity images

Proposes MIMIR to enhance document embeddings with multiple images

Innovation

Methods, ideas, or system contributions that make the work stand out.

Introducing RETINA benchmark to remove visual shortcuts

Using LLM-driven pipeline for automatic benchmark construction

Proposing MIMIR with multi-image document embeddings

🔎 Similar Papers

No similar papers found.