🤖 AI Summary
This work proposes the first learnable multimodal planning agent for visual question answering (VQA) that overcomes the computational redundancy and inefficiency of existing fixed, multi-stage retrieval-augmented generation (mRAG) pipelines. By leveraging reinforcement learning, the agent dynamically prunes the mRAG process, intelligently determining the necessity of each step to enable efficient tool invocation. Evaluated across six VQA benchmarks, the method outperforms current baselines on average, reducing inference time by over 60% while significantly decreasing the number of costly tool calls. Notably, it maintains or even improves answer accuracy, thereby breaking through the performance–efficiency trade-off inherent in conventional static pipelines.
📝 Abstract
Visual Question-Answering (VQA) is a challenging multimodal task that requires integrating visual and textual information to generate accurate responses. While multimodal Retrieval-Augmented Generation (mRAG) has shown promise in enhancing VQA systems by providing more evidence on both image and text sides, the default procedure that addresses VQA queries, especially the knowledge-intensive ones, often relies on multi-stage pipelines of mRAG with inherent dependencies. To mitigate the inefficiency limitations while maintaining VQA task performance, this paper proposes a method that trains a multimodal planning agent, dynamically decomposing the mRAG pipeline to solve the VQA task. Our method optimizes the trade-off between efficiency and effectiveness by training the agent to intelligently determine the necessity of each mRAG step. In our experiments, the agent can help reduce redundant computations, cutting search time by over 60\% compared to existing methods and decreasing costly tool calls. Meanwhile, experiments demonstrate that our method outperforms all baselines, including a Deep Research agent and a carefully designed prompt-based method, on average over six various datasets. Code will be released.