🤖 AI Summary
Existing multimodal large language models (MLLMs) lack rigorous evaluation of cross-modal organic reasoning—particularly in domains requiring deep, inseparable integration of textual and visual information. Method: We introduce EMMA, the first multi-step, multimodal reasoning benchmark focused on mathematics, physics, chemistry, and programming. Unlike text-centric or superficially vision-dependent benchmarks, EMMA explicitly defines and evaluates “non-decomposable modality-cooperative reasoning,” where text and image inputs are mutually indispensable for correct inference. It comprises human-authored, multi-disciplinary problems, rigorously annotated reasoning chains, and adversarially designed images to stress-test robustness. Evaluation supports chain-of-thought prompting and test-time compute scaling. Results: Experiments reveal substantial performance bottlenecks across state-of-the-art MLLMs on EMMA; gains from CoT prompting and increased compute are marginal, highlighting fundamental limitations in current architectures and training paradigms—and providing critical empirical grounding for future innovation.
📝 Abstract
The ability to organically reason over and with both text and images is a pillar of human intelligence, yet the ability of Multimodal Large Language Models (MLLMs) to perform such multimodal reasoning remains under-explored. Existing benchmarks often emphasize text-dominant reasoning or rely on shallow visual cues, failing to adequately assess integrated visual and textual reasoning. We introduce EMMA (Enhanced MultiModal reAsoning), a benchmark targeting organic multimodal reasoning across mathematics, physics, chemistry, and coding. EMMA tasks demand advanced cross-modal reasoning that cannot be addressed by reasoning independently in each modality, offering an enhanced test suite for MLLMs' reasoning capabilities. Our evaluation of state-of-the-art MLLMs on EMMA reveals significant limitations in handling complex multimodal and multi-step reasoning tasks, even with advanced techniques like Chain-of-Thought prompting and test-time compute scaling underperforming. These findings underscore the need for improved multimodal architectures and training paradigms to close the gap between human and model reasoning in multimodality.