π€ AI Summary
To address hallucination issues in Vision-Language Large Models (VLLMs) for Visual Question Answering (VQA), this paper proposes a three-stage Retrieval-Augmented Generation (RAG) framework. The method integrates multi-source visionβtext retrieval, retrieval result re-ranking, and multi-task fine-tuning, augmented by a vision-context-aware data augmentation strategy. It natively supports multi-turn interaction and heterogeneous multimodal information fusion, thereby enhancing visual semantic understanding and alignment with external knowledge. Evaluated on the CRAG-MM benchmark across three tasks, our approach achieves automatic evaluation rankings of 3rd, 3rd, and 1st, respectively; in human evaluation on Task 3, it ranks 2nd. These results demonstrate its effectiveness in mitigating hallucinations and improving factual consistency.
π Abstract
Vision Large Language Models (VLLMs) have improved multi-modal understanding and visual question answering (VQA), but still suffer from hallucinated answers. Multi-modal Retrieval-Augmented Generation (RAG) helps address these issues by incorporating external information, yet challenges remain in visual context comprehension, multi-source retrieval, and multi-turn interactions. To address these challenges, Meta constructed the CRAG-MM benchmark and launched the CRAG-MM Challenge at KDD Cup 2025, which consists of three tasks. This paper describes the solutions of all tasks in Meta KDD Cup'25 from BlackPearl team. We use a single model for each task, with key methods including data augmentation, RAG, reranking, and multi-task fine-tuning. Our solution achieve automatic evaluation rankings of 3rd, 3rd, and 1st on the three tasks, and win second place in Task3 after human evaluation.