Knowledge-based Visual Question Answer with Multimodal Processing, Retrieval and Filtering

📅 2025-10-16

📈 Citations: 0

✨ Influential: 0

career value

175K/year

🤖 AI Summary

To address low-quality multimodal queries and weak retrieval relevance in knowledge base-augmented visual question answering (KB-VQA), this paper proposes Wiki-PRF, a three-stage framework. First, it dynamically invokes tools to generate high-fidelity multimodal queries. Second, it fuses visual and textual features for precise knowledge retrieval. Third, it introduces a learnable relevance filtering mechanism to enhance result reliability. The framework jointly optimizes a vision-language model and reinforcement learning—using answer accuracy and output format consistency as reward signals—for end-to-end training. On E-VQA and InfoSeek, Wiki-PRF achieves absolute improvements of 36.0 and 42.8 points, respectively, setting new state-of-the-art performance. Key innovations include dynamic multimodal query generation, cross-modal retrieval feature fusion, and a differentiable relevance filtering module.

Technology Category

Application Category

📝 Abstract

Knowledge-based visual question answering (KB-VQA) requires visual language models (VLMs) to integrate visual understanding with external knowledge retrieval. Although retrieval-augmented generation (RAG) achieves significant advances in this task by combining knowledge-base querying, it still struggles with the quality of multimodal queries and the relevance of retrieved results. To overcome these challenges, we propose a novel three-stage method, termed Wiki-PRF, including Processing, Retrieval and Filtering stages. The processing stage dynamically invokes visual tools to extract precise multimodal information for retrieval. The retrieval stage integrates visual and text features to achieve multimodal knowledge retrieval. The filtering stage performs relevance filtering and concentration on retrieval results. To this end, we introduce a visual language model trained with answer accuracy and format consistency as reward signals via a reinforcement learning manner. This enhances the model's reasoning, tool invocation for accurate queries, and filtering of irrelevant content. Experiments on benchmark datasets (E-VQA and InfoSeek) show significant improvements~(36.0 and 42.8) in answer quality, achieving state-of-the-art performance. Code is available at https://github.com/cqu-student/Wiki-PRF

Problem

Research questions and friction points this paper is trying to address.

Enhancing multimodal query quality for knowledge retrieval

Improving relevance filtering of retrieved external knowledge

Boosting answer accuracy through reinforcement learning optimization

Innovation

Methods, ideas, or system contributions that make the work stand out.

Multimodal processing dynamically invokes visual tools

Retrieval integrates visual and text features

Filtering performs relevance filtering and concentration

🔎 Similar Papers

No similar papers found.