Knowledge-based Visual Question Answer with Multimodal Processing, Retrieval and Filtering

📅 2025-10-16
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address low-quality multimodal queries and weak retrieval relevance in knowledge base-augmented visual question answering (KB-VQA), this paper proposes Wiki-PRF, a three-stage framework. First, it dynamically invokes tools to generate high-fidelity multimodal queries. Second, it fuses visual and textual features for precise knowledge retrieval. Third, it introduces a learnable relevance filtering mechanism to enhance result reliability. The framework jointly optimizes a vision-language model and reinforcement learning—using answer accuracy and output format consistency as reward signals—for end-to-end training. On E-VQA and InfoSeek, Wiki-PRF achieves absolute improvements of 36.0 and 42.8 points, respectively, setting new state-of-the-art performance. Key innovations include dynamic multimodal query generation, cross-modal retrieval feature fusion, and a differentiable relevance filtering module.

Technology Category

Application Category

📝 Abstract
Knowledge-based visual question answering (KB-VQA) requires visual language models (VLMs) to integrate visual understanding with external knowledge retrieval. Although retrieval-augmented generation (RAG) achieves significant advances in this task by combining knowledge-base querying, it still struggles with the quality of multimodal queries and the relevance of retrieved results. To overcome these challenges, we propose a novel three-stage method, termed Wiki-PRF, including Processing, Retrieval and Filtering stages. The processing stage dynamically invokes visual tools to extract precise multimodal information for retrieval. The retrieval stage integrates visual and text features to achieve multimodal knowledge retrieval. The filtering stage performs relevance filtering and concentration on retrieval results. To this end, we introduce a visual language model trained with answer accuracy and format consistency as reward signals via a reinforcement learning manner. This enhances the model's reasoning, tool invocation for accurate queries, and filtering of irrelevant content. Experiments on benchmark datasets (E-VQA and InfoSeek) show significant improvements~(36.0 and 42.8) in answer quality, achieving state-of-the-art performance. Code is available at https://github.com/cqu-student/Wiki-PRF
Problem

Research questions and friction points this paper is trying to address.

Enhancing multimodal query quality for knowledge retrieval
Improving relevance filtering of retrieved external knowledge
Boosting answer accuracy through reinforcement learning optimization
Innovation

Methods, ideas, or system contributions that make the work stand out.

Multimodal processing dynamically invokes visual tools
Retrieval integrates visual and text features
Filtering performs relevance filtering and concentration
🔎 Similar Papers
No similar papers found.
Y
Yuyang Hong
School of Artificial Intelligence, University of Chinese Academy of Sciences
J
Jiaqi Gu
Alibaba Cloud Computing
Q
Qi Yang
School of Artificial Intelligence, University of Chinese Academy of Sciences
Lubin Fan
Lubin Fan
Alibaba Cloud
Computer GraphicsComputer VisionMLLM
Y
Yue Wu
Alibaba Cloud Computing
Y
Ying Wang
MAIS, Institute of Automation, Chinese Academy of Sciences
Kun Ding
Kun Ding
CASIA
CVMultimodal
Shiming Xiang
Shiming Xiang
National Laboratory of Pattern Recognition, Institute of Automation, Chinese Academy of Sciences
Distance Metric LearningSemi-supervised LearningManifold LearningRegressionFeature Selection
J
Jieping Ye
Alibaba Cloud Computing