MMReason: An Open-Ended Multi-Modal Multi-Step Reasoning Benchmark for MLLMs Toward AGI

📅 2025-06-30
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing multimodal LLM (MLLM) benchmarks inadequately assess long-chain reasoning: they suffer from insufficient question difficulty and diversity, vulnerability to guessing or memorization biases, and lack fine-grained evaluation of intermediate reasoning steps. To address this, we propose OpenMMLR—the first open-ended, multi-step reasoning benchmark specifically designed for MLLMs, spanning six academic disciplines and multiple difficulty levels. Methodologically, it employs open-ended question design coupled with a multi-model voting filtering mechanism to mitigate annotation and model biases; constructs human-annotated stepwise reasoning chains; and introduces a reference-based ternary scoring system enabling automated, interpretable assessment of intermediate steps. Comprehensive evaluation across state-of-the-art MLLMs reveals systematic cross-disciplinary reasoning bottlenecks. We release the benchmark—including the dataset, evaluation toolkit, and reproducible evaluation protocol—to advance AGI-oriented multimodal reasoning research.

Technology Category

Application Category

📝 Abstract
Reasoning plays a crucial role in advancing Multimodal Large Language Models (MLLMs) toward Artificial General Intelligence. However, existing MLLM benchmarks often fall short in precisely and comprehensively evaluating long-chain reasoning abilities from three key aspects: (1) lack of difficulty and diversity, (2) susceptibility to guessability and memorization, (3) inadequate assessment of intermediate reasoning steps. To fill this gap, we introduce MMReason, a new benchmark designed to precisely and comprehensively evaluate MLLM long-chain reasoning capability with diverse, open-ended, challenging questions. First, we curate challenging questions requiring multi-step reasoning from various fields (i.e., 6 disciplines) and multiple difficulty levels (i.e., from pre-university to university, and from foundational to competition tiers). Second, these questions are reformulated into an open-ended format and filtered using a multi-model voting technique to eliminate shortcut cases related to guessing and memorization, ensuring robust reasoning evaluations. Third, we annotate the questions with detailed step-by-step solutions, and design a reference-based ternary scoring mechanism to reliably assess intermediate reasoning steps. With MMReason, we benchmark popular leading MLLMs and provide an in-depth analysis of their reasoning capabilities. We hope MMReason will serve as a valuable resource for advancing MLLM reasoning research. Code will be available at https://github.com/HJYao00/MMReason.
Problem

Research questions and friction points this paper is trying to address.

Evaluates MLLM long-chain reasoning with diverse questions
Eliminates guessing and memorization in reasoning assessments
Assesses intermediate reasoning steps with detailed solutions
Innovation

Methods, ideas, or system contributions that make the work stand out.

Diverse multi-step questions from 6 disciplines
Open-ended format with multi-model voting filtering
Reference-based ternary scoring for reasoning steps
🔎 Similar Papers
No similar papers found.
Huanjin Yao
Huanjin Yao
Tsinghua University
LLMMLLM
J
Jiaxing Huang
Nanyang Technological University
Y
Yawen Qiu
Tsinghua University
M
Michael K. Chen
Nanyang Technological University
W
Wenzheng Liu
University of California
W
Wei Zhang
University of Science and Technology of China
W
Wenjie Zeng
Tsinghua University
X
Xikun Zhang
Nanyang Technological University
J
Jingyi Zhang
Nanyang Technological University
Yuxin Song
Yuxin Song
Baidu
Computer VisionVision-Language ModelGenerative ModelVideo Understanding
W
Wenhao Wu
Baidu Inc.
Dacheng Tao
Dacheng Tao
Nanyang Technological University
artificial intelligencemachine learningcomputer visionimage processingdata mining