MMReason: An Open-Ended Multi-Modal Multi-Step Reasoning Benchmark for MLLMs Toward AGI

📅 2025-06-30

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

Existing multimodal LLM (MLLM) benchmarks inadequately assess long-chain reasoning: they suffer from insufficient question difficulty and diversity, vulnerability to guessing or memorization biases, and lack fine-grained evaluation of intermediate reasoning steps. To address this, we propose OpenMMLR—the first open-ended, multi-step reasoning benchmark specifically designed for MLLMs, spanning six academic disciplines and multiple difficulty levels. Methodologically, it employs open-ended question design coupled with a multi-model voting filtering mechanism to mitigate annotation and model biases; constructs human-annotated stepwise reasoning chains; and introduces a reference-based ternary scoring system enabling automated, interpretable assessment of intermediate steps. Comprehensive evaluation across state-of-the-art MLLMs reveals systematic cross-disciplinary reasoning bottlenecks. We release the benchmark—including the dataset, evaluation toolkit, and reproducible evaluation protocol—to advance AGI-oriented multimodal reasoning research.

Technology Category

Application Category

📝 Abstract

Reasoning plays a crucial role in advancing Multimodal Large Language Models (MLLMs) toward Artificial General Intelligence. However, existing MLLM benchmarks often fall short in precisely and comprehensively evaluating long-chain reasoning abilities from three key aspects: (1) lack of difficulty and diversity, (2) susceptibility to guessability and memorization, (3) inadequate assessment of intermediate reasoning steps. To fill this gap, we introduce MMReason, a new benchmark designed to precisely and comprehensively evaluate MLLM long-chain reasoning capability with diverse, open-ended, challenging questions. First, we curate challenging questions requiring multi-step reasoning from various fields (i.e., 6 disciplines) and multiple difficulty levels (i.e., from pre-university to university, and from foundational to competition tiers). Second, these questions are reformulated into an open-ended format and filtered using a multi-model voting technique to eliminate shortcut cases related to guessing and memorization, ensuring robust reasoning evaluations. Third, we annotate the questions with detailed step-by-step solutions, and design a reference-based ternary scoring mechanism to reliably assess intermediate reasoning steps. With MMReason, we benchmark popular leading MLLMs and provide an in-depth analysis of their reasoning capabilities. We hope MMReason will serve as a valuable resource for advancing MLLM reasoning research. Code will be available at https://github.com/HJYao00/MMReason.

Problem

Research questions and friction points this paper is trying to address.

Evaluates MLLM long-chain reasoning with diverse questions

Eliminates guessing and memorization in reasoning assessments

Assesses intermediate reasoning steps with detailed solutions

Innovation

Methods, ideas, or system contributions that make the work stand out.

Diverse multi-step questions from 6 disciplines

Open-ended format with multi-model voting filtering

Reference-based ternary scoring for reasoning steps

🔎 Similar Papers

No similar papers found.

Authors to Follow