🤖 AI Summary
A comprehensive multimodal multi-round retrieval-augmented generation (MM-RAG) benchmark tailored to wearable-device scenarios—particularly first-person vision—is currently lacking.
Method: We introduce CRAG-MM, the first holistic MM-RAG benchmark designed specifically for first-person wearable devices. It comprises 6.5K image–text question-answer triplets and 2K multi-turn visual dialogues spanning 13 real-world domains. CRAG-MM innovatively incorporates multiple challenge dimensions—including image quality, entity popularity, and information dynamism—and provides a dual-source retrieval corpus integrating image–knowledge graphs and web pages to support both single- and multi-source augmentation as well as multi-turn dialogue. Inputs are simulated using 6.2K photorealistic first-person images.
Results: Evaluation reveals that state-of-the-art methods achieve only 32% and 43% factual accuracy on single-turn and multi-turn QA, respectively. CRAG-MM served as the official task of KDD Cup 2025, attracting over 1,000 teams; top-performing solutions improved baseline performance by 28%.
📝 Abstract
Wearable devices such as smart glasses are transforming the way people interact with their surroundings, enabling users to seek information regarding entities in their view. Multi-Modal Retrieval-Augmented Generation (MM-RAG) plays a key role in supporting such questions, yet there is still no comprehensive benchmark for this task, especially regarding wearables scenarios. To fill this gap, we present CRAG-MM -- a Comprehensive RAG benchmark for Multi-modal Multi-turn conversations. CRAG-MM contains a diverse set of 6.5K (image, question, answer) triplets and 2K visual-based multi-turn conversations across 13 domains, including 6.2K egocentric images designed to mimic captures from wearable devices. We carefully constructed the questions to reflect real-world scenarios and challenges, including five types of image-quality issues, six question types, varying entity popularity, differing information dynamism, and different conversation turns. We design three tasks: single-source augmentation, multi-source augmentation, and multi-turn conversations -- each paired with an associated retrieval corpus and APIs for both image-KG retrieval and webpage retrieval. Our evaluation shows that straightforward RAG approaches achieve only 32% and 43% truthfulness on CRAG-MM single- and multi-turn QA, respectively, whereas state-of-the-art industry solutions have similar quality (32%/45%), underscoring ample room for improvement. The benchmark has hosted KDD Cup 2025, attracting about 1K participants and 5K submissions, with winning solutions improving baseline performance by 28%, highlighting its early impact on advancing the field.