🤖 AI Summary
To address challenges in multimodal multi-turn question answering—including difficulty in structured data integration, high hallucination rates, and weak contextual modeling—this paper proposes a retrieval-augmented generation (RAG) framework that jointly leverages vision-language models (VLMs), image-based knowledge graphs, and Web search APIs for cross-modal, multi-source information retrieval. It innovatively incorporates curriculum learning into the reinforcement learning phase to dynamically adjust training difficulty and suppress hallucination. Additionally, generation quality is enhanced via knowledge distillation from GPT-4.1 and supervised fine-tuning. Evaluated on multimodal multi-turn QA (Task 3) and knowledge graph QA (Task 1), the method achieves third and first place, respectively—with a 52.38% performance margin over the second-place system on Task 1—demonstrating its effectiveness in complex query understanding, multi-source information aggregation, and long-range context modeling.
📝 Abstract
This paper describes the solutions of the Dianping-Trust-Safety team for the META CRAG-MM challenge. The challenge requires building a comprehensive retrieval-augmented generation system capable for multi-modal multi-turn question answering. The competition consists of three tasks: (1) answering questions using structured data retrieved from an image-based mock knowledge graph, (2) synthesizing information from both knowledge graphs and web search results, and (3) handling multi-turn conversations that require context understanding and information aggregation from multiple sources. For Task 1, our solution is based on the vision large language model, enhanced by supervised fine-tuning with knowledge distilled from GPT-4.1. We further applied curriculum learning strategies to guide reinforcement learning, resulting in improved answer accuracy and reduced hallucination. For Task 2 and Task 3, we additionally leveraged web search APIs to incorporate external knowledge, enabling the system to better handle complex queries and multi-turn conversations. Our approach achieved 1st place in Task 1 with a significant lead of 52.38%, and 3rd place in Task 3, demonstrating the effectiveness of the integration of curriculum learning with reinforcement learning in our training pipeline.