Can MLLMs Reason in Multimodality? EMMA: An Enhanced MultiModal ReAsoning Benchmark

📅 2025-01-09

📈 Citations: 0

✨ Influential: 0

career value

184K/year

🤖 AI Summary

Existing multimodal large language models (MLLMs) lack rigorous evaluation of cross-modal organic reasoning—particularly in domains requiring deep, inseparable integration of textual and visual information. Method: We introduce EMMA, the first multi-step, multimodal reasoning benchmark focused on mathematics, physics, chemistry, and programming. Unlike text-centric or superficially vision-dependent benchmarks, EMMA explicitly defines and evaluates “non-decomposable modality-cooperative reasoning,” where text and image inputs are mutually indispensable for correct inference. It comprises human-authored, multi-disciplinary problems, rigorously annotated reasoning chains, and adversarially designed images to stress-test robustness. Evaluation supports chain-of-thought prompting and test-time compute scaling. Results: Experiments reveal substantial performance bottlenecks across state-of-the-art MLLMs on EMMA; gains from CoT prompting and increased compute are marginal, highlighting fundamental limitations in current architectures and training paradigms—and providing critical empirical grounding for future innovation.

Technology Category

Application Category

📝 Abstract

The ability to organically reason over and with both text and images is a pillar of human intelligence, yet the ability of Multimodal Large Language Models (MLLMs) to perform such multimodal reasoning remains under-explored. Existing benchmarks often emphasize text-dominant reasoning or rely on shallow visual cues, failing to adequately assess integrated visual and textual reasoning. We introduce EMMA (Enhanced MultiModal reAsoning), a benchmark targeting organic multimodal reasoning across mathematics, physics, chemistry, and coding. EMMA tasks demand advanced cross-modal reasoning that cannot be addressed by reasoning independently in each modality, offering an enhanced test suite for MLLMs' reasoning capabilities. Our evaluation of state-of-the-art MLLMs on EMMA reveals significant limitations in handling complex multimodal and multi-step reasoning tasks, even with advanced techniques like Chain-of-Thought prompting and test-time compute scaling underperforming. These findings underscore the need for improved multimodal architectures and training paradigms to close the gap between human and model reasoning in multimodality.

Problem

Research questions and friction points this paper is trying to address.

Multimodal Understanding

Large-scale Models

STEM Education

Innovation

Methods, ideas, or system contributions that make the work stand out.

EMMA

Multimodal Information Processing

Model Evaluation Tool

🔎 Similar Papers

No similar papers found.