🤖 AI Summary
This work addresses the limitations of existing knowledge-based visual question answering (KB-VQA) approaches, which rely on rigid retrieve-then-generate pipelines that struggle to handle diverse questions and suffer from misalignment between retrieved evidence and the query due to a disconnect between retrieval and reasoning. To overcome these issues, the paper introduces a novel agent framework that formulates KB-VQA as a multi-step decision-making process, where the agent dynamically selects among actions—such as answering directly, retrieving images or text, or generating descriptions—at each step to unify retrieval and reasoning. By designing a structured action space, automatically collecting multi-step reasoning trajectories, and incorporating instruction tuning, the proposed method achieves state-of-the-art performance on the InfoSeek and E-VQA benchmarks, significantly outperforming existing baselines and demonstrating its effectiveness and innovation.
📝 Abstract
Knowledge-based visual question answering (KB-VQA) requires vision-language models to understand images and use external knowledge, especially for rare entities and long-tail facts. Most existing retrieval-augmented generation (RAG) methods adopt a fixed pipeline that sequentially retrieves information, filters it, and then produces an answer. Such a design makes it difficult to adapt to diverse question types. Moreover, it separates retrieval from reasoning, making it hard for the model to decide when to search, how to refine queries, or when to stop. As a result, the retrieved evidence is often poorly aligned with the question. To address these limitations, we reformulate KB-VQA as a search-agent problem and model the solving process as a multi-step decision-making procedure. At each step, the agent selects one of four actions-Answer, Image Retrieval, Text Retrieval, and Caption-based on its current information state. We further design an automated pipeline to collect multi-step trajectories that record the agent's reasoning process, tool usage, and intermediate decisions. These trajectories are then used as supervision for fine-tuning. Experiments on InfoSeek and E-VQA demonstrate that our method achieves state-of-the-art performance, consistently outperforming prior baselines and confirming the effectiveness of our framework.