Learning to Search: A Decision-Based Agent for Knowledge-Based Visual Question Answering

📅 2026-04-08
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the limitations of existing knowledge-based visual question answering (KB-VQA) approaches, which rely on rigid retrieve-then-generate pipelines that struggle to handle diverse questions and suffer from misalignment between retrieved evidence and the query due to a disconnect between retrieval and reasoning. To overcome these issues, the paper introduces a novel agent framework that formulates KB-VQA as a multi-step decision-making process, where the agent dynamically selects among actions—such as answering directly, retrieving images or text, or generating descriptions—at each step to unify retrieval and reasoning. By designing a structured action space, automatically collecting multi-step reasoning trajectories, and incorporating instruction tuning, the proposed method achieves state-of-the-art performance on the InfoSeek and E-VQA benchmarks, significantly outperforming existing baselines and demonstrating its effectiveness and innovation.
📝 Abstract
Knowledge-based visual question answering (KB-VQA) requires vision-language models to understand images and use external knowledge, especially for rare entities and long-tail facts. Most existing retrieval-augmented generation (RAG) methods adopt a fixed pipeline that sequentially retrieves information, filters it, and then produces an answer. Such a design makes it difficult to adapt to diverse question types. Moreover, it separates retrieval from reasoning, making it hard for the model to decide when to search, how to refine queries, or when to stop. As a result, the retrieved evidence is often poorly aligned with the question. To address these limitations, we reformulate KB-VQA as a search-agent problem and model the solving process as a multi-step decision-making procedure. At each step, the agent selects one of four actions-Answer, Image Retrieval, Text Retrieval, and Caption-based on its current information state. We further design an automated pipeline to collect multi-step trajectories that record the agent's reasoning process, tool usage, and intermediate decisions. These trajectories are then used as supervision for fine-tuning. Experiments on InfoSeek and E-VQA demonstrate that our method achieves state-of-the-art performance, consistently outperforming prior baselines and confirming the effectiveness of our framework.
Problem

Research questions and friction points this paper is trying to address.

Knowledge-based Visual Question Answering
Retrieval-Augmented Generation
Search-Agent
Multi-step Decision Making
Evidence Alignment
Innovation

Methods, ideas, or system contributions that make the work stand out.

search agent
multi-step decision-making
retrieval-augmented generation
knowledge-based VQA
dynamic reasoning
🔎 Similar Papers
No similar papers found.
Z
Zhuohong Chen
Tsinghua University
Z
Zhenxian Wu
Tsinghua University
Y
Yunyao Yu
Tsinghua University
H
Hangrui Xu
Hefei University of Technology
Z
Zirui Liao
Tsinghua University
Zhifang Liu
Zhifang Liu
School of Mathematical Sciences, Tianjin Normal University
image processing
X
Xiangwen Deng
University of Arizona
P
Pen Jiao
Tsinghua University
H
Haoqian Wang
Tsinghua University