🤖 AI Summary
Existing multimodal LLM (MLLM) benchmarks inadequately assess long-chain reasoning: they suffer from insufficient question difficulty and diversity, vulnerability to guessing or memorization biases, and lack fine-grained evaluation of intermediate reasoning steps. To address this, we propose OpenMMLR—the first open-ended, multi-step reasoning benchmark specifically designed for MLLMs, spanning six academic disciplines and multiple difficulty levels. Methodologically, it employs open-ended question design coupled with a multi-model voting filtering mechanism to mitigate annotation and model biases; constructs human-annotated stepwise reasoning chains; and introduces a reference-based ternary scoring system enabling automated, interpretable assessment of intermediate steps. Comprehensive evaluation across state-of-the-art MLLMs reveals systematic cross-disciplinary reasoning bottlenecks. We release the benchmark—including the dataset, evaluation toolkit, and reproducible evaluation protocol—to advance AGI-oriented multimodal reasoning research.
📝 Abstract
Reasoning plays a crucial role in advancing Multimodal Large Language Models (MLLMs) toward Artificial General Intelligence. However, existing MLLM benchmarks often fall short in precisely and comprehensively evaluating long-chain reasoning abilities from three key aspects: (1) lack of difficulty and diversity, (2) susceptibility to guessability and memorization, (3) inadequate assessment of intermediate reasoning steps. To fill this gap, we introduce MMReason, a new benchmark designed to precisely and comprehensively evaluate MLLM long-chain reasoning capability with diverse, open-ended, challenging questions. First, we curate challenging questions requiring multi-step reasoning from various fields (i.e., 6 disciplines) and multiple difficulty levels (i.e., from pre-university to university, and from foundational to competition tiers). Second, these questions are reformulated into an open-ended format and filtered using a multi-model voting technique to eliminate shortcut cases related to guessing and memorization, ensuring robust reasoning evaluations. Third, we annotate the questions with detailed step-by-step solutions, and design a reference-based ternary scoring mechanism to reliably assess intermediate reasoning steps. With MMReason, we benchmark popular leading MLLMs and provide an in-depth analysis of their reasoning capabilities. We hope MMReason will serve as a valuable resource for advancing MLLM reasoning research. Code will be available at https://github.com/HJYao00/MMReason.