🤖 AI Summary
To address low-quality multimodal queries and weak retrieval relevance in knowledge base-augmented visual question answering (KB-VQA), this paper proposes Wiki-PRF, a three-stage framework. First, it dynamically invokes tools to generate high-fidelity multimodal queries. Second, it fuses visual and textual features for precise knowledge retrieval. Third, it introduces a learnable relevance filtering mechanism to enhance result reliability. The framework jointly optimizes a vision-language model and reinforcement learning—using answer accuracy and output format consistency as reward signals—for end-to-end training. On E-VQA and InfoSeek, Wiki-PRF achieves absolute improvements of 36.0 and 42.8 points, respectively, setting new state-of-the-art performance. Key innovations include dynamic multimodal query generation, cross-modal retrieval feature fusion, and a differentiable relevance filtering module.
📝 Abstract
Knowledge-based visual question answering (KB-VQA) requires visual language models (VLMs) to integrate visual understanding with external knowledge retrieval. Although retrieval-augmented generation (RAG) achieves significant advances in this task by combining knowledge-base querying, it still struggles with the quality of multimodal queries and the relevance of retrieved results. To overcome these challenges, we propose a novel three-stage method, termed Wiki-PRF, including Processing, Retrieval and Filtering stages. The processing stage dynamically invokes visual tools to extract precise multimodal information for retrieval. The retrieval stage integrates visual and text features to achieve multimodal knowledge retrieval. The filtering stage performs relevance filtering and concentration on retrieval results. To this end, we introduce a visual language model trained with answer accuracy and format consistency as reward signals via a reinforcement learning manner. This enhances the model's reasoning, tool invocation for accurate queries, and filtering of irrelevant content. Experiments on benchmark datasets (E-VQA and InfoSeek) show significant improvements~(36.0 and 42.8) in answer quality, achieving state-of-the-art performance. Code is available at https://github.com/cqu-student/Wiki-PRF