CRAG-MM: Multi-modal Multi-turn Comprehensive RAG Benchmark

📅 2025-10-30
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
A comprehensive multimodal multi-round retrieval-augmented generation (MM-RAG) benchmark tailored to wearable-device scenarios—particularly first-person vision—is currently lacking. Method: We introduce CRAG-MM, the first holistic MM-RAG benchmark designed specifically for first-person wearable devices. It comprises 6.5K image–text question-answer triplets and 2K multi-turn visual dialogues spanning 13 real-world domains. CRAG-MM innovatively incorporates multiple challenge dimensions—including image quality, entity popularity, and information dynamism—and provides a dual-source retrieval corpus integrating image–knowledge graphs and web pages to support both single- and multi-source augmentation as well as multi-turn dialogue. Inputs are simulated using 6.2K photorealistic first-person images. Results: Evaluation reveals that state-of-the-art methods achieve only 32% and 43% factual accuracy on single-turn and multi-turn QA, respectively. CRAG-MM served as the official task of KDD Cup 2025, attracting over 1,000 teams; top-performing solutions improved baseline performance by 28%.

Technology Category

Application Category

📝 Abstract
Wearable devices such as smart glasses are transforming the way people interact with their surroundings, enabling users to seek information regarding entities in their view. Multi-Modal Retrieval-Augmented Generation (MM-RAG) plays a key role in supporting such questions, yet there is still no comprehensive benchmark for this task, especially regarding wearables scenarios. To fill this gap, we present CRAG-MM -- a Comprehensive RAG benchmark for Multi-modal Multi-turn conversations. CRAG-MM contains a diverse set of 6.5K (image, question, answer) triplets and 2K visual-based multi-turn conversations across 13 domains, including 6.2K egocentric images designed to mimic captures from wearable devices. We carefully constructed the questions to reflect real-world scenarios and challenges, including five types of image-quality issues, six question types, varying entity popularity, differing information dynamism, and different conversation turns. We design three tasks: single-source augmentation, multi-source augmentation, and multi-turn conversations -- each paired with an associated retrieval corpus and APIs for both image-KG retrieval and webpage retrieval. Our evaluation shows that straightforward RAG approaches achieve only 32% and 43% truthfulness on CRAG-MM single- and multi-turn QA, respectively, whereas state-of-the-art industry solutions have similar quality (32%/45%), underscoring ample room for improvement. The benchmark has hosted KDD Cup 2025, attracting about 1K participants and 5K submissions, with winning solutions improving baseline performance by 28%, highlighting its early impact on advancing the field.
Problem

Research questions and friction points this paper is trying to address.

Lack of comprehensive multimodal RAG benchmark for wearables
Addresses multi-turn conversational QA with egocentric images
Evaluates RAG performance under realistic image quality challenges
Innovation

Methods, ideas, or system contributions that make the work stand out.

Multi-modal multi-turn comprehensive RAG benchmark
Diverse 6.5K image-question-answer triplets dataset
APIs for image-KG and webpage retrieval tasks
🔎 Similar Papers
No similar papers found.
J
Jiaqi Wang
Meta Reality Labs
X
Xiao Yang
Meta Reality Labs
K
Kai Sun
Meta Reality Labs
P
Parth Suresh
Meta Reality Labs
Sanat Sharma
Sanat Sharma
Meta, previously Adobe, University of Texas, Microsoft
A
Adam Czyzewski
Meta Reality Labs
D
Derek Andersen
Meta Reality Labs
S
Surya Appini
Meta Reality Labs
A
Arkav Banerjee
Meta Reality Labs
S
Sajal Choudhary
Meta Reality Labs
S
Shervin Ghasemlou
Meta Reality Labs
Z
Ziqiang Guan
Meta Reality Labs
A
Akil Iyer
Meta Reality Labs
Haidar Khan
Haidar Khan
Meta
Natural Language ProcessingMachine Learning
L
Lingkun Kong
Meta Reality Labs
R
Roy Luo
Meta Reality Labs
T
Tiffany Ma
Meta Reality Labs
Z
Zhen Qiao
Meta Superintelligence Labs
D
David Tran
Meta Reality Labs
W
Wenfang Xu
Meta Reality Labs
S
Skyler Yeatman
Meta Reality Labs
C
Chen Zhou
Meta Reality Labs
G
Gunveer Gujral
Meta Reality Labs
Yinglong Xia
Yinglong Xia
Facebook
graph analysisparallel computinghigh performance computinggraphical models
S
Shane Moon
Meta Reality Labs