CRAG-MM: Multi-modal Multi-turn Comprehensive RAG Benchmark

📅 2025-10-30

📈 Citations: 0

✨ Influential: 0

career value

189K/year

🤖 AI Summary

A comprehensive multimodal multi-round retrieval-augmented generation (MM-RAG) benchmark tailored to wearable-device scenarios—particularly first-person vision—is currently lacking. Method: We introduce CRAG-MM, the first holistic MM-RAG benchmark designed specifically for first-person wearable devices. It comprises 6.5K image–text question-answer triplets and 2K multi-turn visual dialogues spanning 13 real-world domains. CRAG-MM innovatively incorporates multiple challenge dimensions—including image quality, entity popularity, and information dynamism—and provides a dual-source retrieval corpus integrating image–knowledge graphs and web pages to support both single- and multi-source augmentation as well as multi-turn dialogue. Inputs are simulated using 6.2K photorealistic first-person images. Results: Evaluation reveals that state-of-the-art methods achieve only 32% and 43% factual accuracy on single-turn and multi-turn QA, respectively. CRAG-MM served as the official task of KDD Cup 2025, attracting over 1,000 teams; top-performing solutions improved baseline performance by 28%.

Technology Category

Application Category

📝 Abstract

Wearable devices such as smart glasses are transforming the way people interact with their surroundings, enabling users to seek information regarding entities in their view. Multi-Modal Retrieval-Augmented Generation (MM-RAG) plays a key role in supporting such questions, yet there is still no comprehensive benchmark for this task, especially regarding wearables scenarios. To fill this gap, we present CRAG-MM -- a Comprehensive RAG benchmark for Multi-modal Multi-turn conversations. CRAG-MM contains a diverse set of 6.5K (image, question, answer) triplets and 2K visual-based multi-turn conversations across 13 domains, including 6.2K egocentric images designed to mimic captures from wearable devices. We carefully constructed the questions to reflect real-world scenarios and challenges, including five types of image-quality issues, six question types, varying entity popularity, differing information dynamism, and different conversation turns. We design three tasks: single-source augmentation, multi-source augmentation, and multi-turn conversations -- each paired with an associated retrieval corpus and APIs for both image-KG retrieval and webpage retrieval. Our evaluation shows that straightforward RAG approaches achieve only 32% and 43% truthfulness on CRAG-MM single- and multi-turn QA, respectively, whereas state-of-the-art industry solutions have similar quality (32%/45%), underscoring ample room for improvement. The benchmark has hosted KDD Cup 2025, attracting about 1K participants and 5K submissions, with winning solutions improving baseline performance by 28%, highlighting its early impact on advancing the field.

Problem

Research questions and friction points this paper is trying to address.

Lack of comprehensive multimodal RAG benchmark for wearables

Addresses multi-turn conversational QA with egocentric images

Evaluates RAG performance under realistic image quality challenges

Innovation

Methods, ideas, or system contributions that make the work stand out.

Multi-modal multi-turn comprehensive RAG benchmark

Diverse 6.5K image-question-answer triplets dataset

APIs for image-KG and webpage retrieval tasks

🔎 Similar Papers

RAD-Bench: Evaluating Large Language Models Capabilities in Retrieval Augmented Dialogues

2024-09-19arXiv.orgCitations: 0

FlashRAG: A Modular Toolkit for Efficient Retrieval-Augmented Generation Research

2024-05-22arXiv.orgCitations: 18