Wiki-R1: Incentivizing Multimodal Reasoning for Knowledge-based VQA via Data and Sampling Curriculum

📅 2026-03-05

📈 Citations: 0

✨ Influential: 0

career value

176K/year

🤖 AI Summary

Knowledge-based visual question answering (KB-VQA) faces significant challenges due to retrieval noise and the distributional mismatch between pretrained multimodal large language models and structured knowledge bases, which hinders effective reasoning and domain adaptation. To address these issues, this work proposes Wiki-R1, a framework that leverages curriculum reinforcement learning to dynamically construct a sequence of training distributions aligned with the evolving capabilities of the model. The core innovations include a controllable retriever that generates samples of specified difficulty, a reward-estimation-based difficulty propagation mechanism, and a curriculum sampling strategy focused on non-zero advantage samples. Evaluated on the Encyclopedic VQA and InfoSeek benchmarks, Wiki-R1 achieves state-of-the-art performance, improving accuracy to 37.1% and 44.1%, respectively.

Technology Category

Application Category

📝 Abstract

Knowledge-Based Visual Question Answering (KB-VQA) requires models to answer questions about an image by integrating external knowledge, posing significant challenges due to noisy retrieval and the structured, encyclopedic nature of the knowledge base. These characteristics create a distributional gap from pretrained multimodal large language models (MLLMs), making effective reasoning and domain adaptation difficult in the post-training stage. In this work, we propose \textit{Wiki-R1}, a data-generation-based curriculum reinforcement learning framework that systematically incentivizes reasoning in MLLMs for KB-VQA. Wiki-R1 constructs a sequence of training distributions aligned with the model's evolving capability, bridging the gap from pretraining to the KB-VQA target distribution. We introduce \textit{controllable curriculum data generation}, which manipulates the retriever to produce samples at desired difficulty levels, and a \textit{curriculum sampling strategy} that selects informative samples likely to yield non-zero advantages during RL updates. Sample difficulty is estimated using observed rewards and propagated to unobserved samples to guide learning. Experiments on two KB-VQA benchmarks, Encyclopedic VQA and InfoSeek, demonstrate that Wiki-R1 achieves new state-of-the-art results, improving accuracy from 35.5\% to 37.1\% on Encyclopedic VQA and from 40.1\% to 44.1\% on InfoSeek. The project page is available at https://artanic30.github.io/project_pages/WikiR1/.

Problem

Research questions and friction points this paper is trying to address.

Knowledge-Based Visual Question Answering

Multimodal Reasoning

Distributional Gap

External Knowledge Integration

Domain Adaptation

Innovation

Methods, ideas, or system contributions that make the work stand out.

curriculum reinforcement learning

controllable data generation

multimodal reasoning