A Curriculum Learning Approach to Reinforcement Learning: Leveraging RAG for Multimodal Question Answering

📅 2025-08-14
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address challenges in multimodal multi-turn question answering—including difficulty in structured data integration, high hallucination rates, and weak contextual modeling—this paper proposes a retrieval-augmented generation (RAG) framework that jointly leverages vision-language models (VLMs), image-based knowledge graphs, and Web search APIs for cross-modal, multi-source information retrieval. It innovatively incorporates curriculum learning into the reinforcement learning phase to dynamically adjust training difficulty and suppress hallucination. Additionally, generation quality is enhanced via knowledge distillation from GPT-4.1 and supervised fine-tuning. Evaluated on multimodal multi-turn QA (Task 3) and knowledge graph QA (Task 1), the method achieves third and first place, respectively—with a 52.38% performance margin over the second-place system on Task 1—demonstrating its effectiveness in complex query understanding, multi-source information aggregation, and long-range context modeling.

Technology Category

Application Category

📝 Abstract
This paper describes the solutions of the Dianping-Trust-Safety team for the META CRAG-MM challenge. The challenge requires building a comprehensive retrieval-augmented generation system capable for multi-modal multi-turn question answering. The competition consists of three tasks: (1) answering questions using structured data retrieved from an image-based mock knowledge graph, (2) synthesizing information from both knowledge graphs and web search results, and (3) handling multi-turn conversations that require context understanding and information aggregation from multiple sources. For Task 1, our solution is based on the vision large language model, enhanced by supervised fine-tuning with knowledge distilled from GPT-4.1. We further applied curriculum learning strategies to guide reinforcement learning, resulting in improved answer accuracy and reduced hallucination. For Task 2 and Task 3, we additionally leveraged web search APIs to incorporate external knowledge, enabling the system to better handle complex queries and multi-turn conversations. Our approach achieved 1st place in Task 1 with a significant lead of 52.38%, and 3rd place in Task 3, demonstrating the effectiveness of the integration of curriculum learning with reinforcement learning in our training pipeline.
Problem

Research questions and friction points this paper is trying to address.

Building multimodal question answering system with RAG
Enhancing answer accuracy while reducing hallucination
Handling multi-turn conversations with context understanding
Innovation

Methods, ideas, or system contributions that make the work stand out.

Curriculum learning guides reinforcement learning training
Vision large language model enhanced by supervised fine-tuning
Web search APIs incorporated for external knowledge integration
🔎 Similar Papers
No similar papers found.
C
Chenliang Zhang
Meituan, Shanghai, China
L
Lin Wang
Meituan, Shanghai, China
Y
Yuanyuan Lu
Meituan, Shanghai, China
Y
Yusheng Qi
Meituan, Shanghai, China
Kexin Wang
Kexin Wang
PhD student of Biomedical Engineering, Johns Hopkins University
MRICEST
P
Peixu Hou
Meituan, Shanghai, China
W
Wenshi Chen
Meituan, Shanghai, China