🤖 AI Summary
Existing multimodal evaluation benchmarks inadequately assess joint image-text understanding and cross-modal reasoning, failing to reflect models’ authentic, synergistic “seeing” and “reading” capabilities. To address this, we propose MMMU-Pro—a novel, multidisciplinary, and robust multimodal understanding and reasoning benchmark. Its core innovation lies in a threefold strengthening strategy: (1) eliminating text-only solvable questions; (2) expanding distractor options to increase ambiguity; and (3) introducing, for the first time, visual-only inputs containing embedded textual elements (e.g., charts, signs), thereby compelling models to recognize and semantically interpret in-image text—mimicking human cross-modal integration. Experiments show that state-of-the-art models suffer substantial performance drops (16.8%–26.9%) on MMMU-Pro, confirming its heightened difficulty. Chain-of-thought prompting improves performance, whereas OCR-specific prompts yield marginal gains. MMMU-Pro establishes a more realistic, cognitively grounded evaluation standard for multimodal AI.
📝 Abstract
This paper introduces MMMU-Pro, a robust version of the Massive Multi-discipline Multimodal Understanding and Reasoning (MMMU) benchmark. MMMU-Pro rigorously assesses multimodal models' true understanding and reasoning capabilities through a three-step process based on MMMU: (1) filtering out questions answerable by text-only models, (2) augmenting candidate options, and (3) introducing a vision-only input setting where questions are embedded within images. This setting challenges AI to truly"see"and"read"simultaneously, testing a fundamental human cognitive skill of seamlessly integrating visual and textual information. Results show that model performance is substantially lower on MMMU-Pro than on MMMU, ranging from 16.8% to 26.9% across models. We explore the impact of OCR prompts and Chain of Thought (CoT) reasoning, finding that OCR prompts have minimal effect while CoT generally improves performance. MMMU-Pro provides a more rigorous evaluation tool, closely mimicking real-world scenarios and offering valuable directions for future research in multimodal AI.