Evaluating MLLMs with Multimodal Multi-image Reasoning Benchmark

📅 2025-06-04
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing multimodal large language model (MLLM) benchmarks predominantly focus on single-image reasoning or evaluate only final answers, lacking fine-grained assessment of structured visual reasoning across multiple images. Method: We introduce MMRB—the first benchmark dedicated to multi-image structured visual reasoning—comprising 92 spatial, temporal, and semantic subtasks. We propose a novel stepwise reasoning evaluation paradigm, incorporating multi-solution chain-of-thought (CoT) annotations refined via human curation, and construct a dedicated reward modeling subset for multi-image ordering. Additionally, we release an LLM-based, sentence-level automatic evaluation framework. Results: Evaluations across 40 MLLMs—including 9 reasoning-specialized models—reveal that open-source models significantly underperform commercial counterparts, and current multimodal reward models exhibit near-total failure on multi-image ordering tasks.

Technology Category

Application Category

📝 Abstract
With enhanced capabilities and widespread applications, Multimodal Large Language Models (MLLMs) are increasingly required to process and reason over multiple images simultaneously. However, existing MLLM benchmarks focus either on single-image visual reasoning or on multi-image understanding tasks with only final-answer evaluation, leaving the reasoning capabilities of MLLMs over multi-image inputs largely underexplored. To address this gap, we introduce the $ extbf{Multimodal Multi-image Reasoning Benchmark (MMRB)}$, the first benchmark designed to evaluate structured visual reasoning across multiple images. MMRB comprises $ extbf{92 sub-tasks}$ covering spatial, temporal, and semantic reasoning, with multi-solution, CoT-style annotations generated by GPT-4o and refined by human experts. A derivative subset is designed to evaluate multimodal reward models in multi-image scenarios. To support fast and scalable evaluation, we propose a sentence-level matching framework using open-source LLMs. Extensive baseline experiments on $ extbf{40 MLLMs}$, including 9 reasoning-specific models and 8 reward models, demonstrate that open-source MLLMs still lag significantly behind commercial MLLMs in multi-image reasoning tasks. Furthermore, current multimodal reward models are nearly incapable of handling multi-image reward ranking tasks.
Problem

Research questions and friction points this paper is trying to address.

Evaluating MLLMs' reasoning over multiple images lacks benchmarks
Current benchmarks focus on single-image or final-answer tasks
Multimodal reward models struggle with multi-image ranking tasks
Innovation

Methods, ideas, or system contributions that make the work stand out.

Introduces Multimodal Multi-image Reasoning Benchmark (MMRB)
Uses GPT-4o and human experts for annotations
Proposes sentence-level matching framework for evaluation
🔎 Similar Papers
No similar papers found.
Ziming Cheng
Ziming Cheng
National University of Singapore, BUPT, SenseTime
Multimodel-LLMWeb Agent3D Human Pose Estimation
B
Binrui Xu
BUPT, China
L
Lisheng Gong
BUPT, China
Z
Zuhe Song
BUPT, China
T
Tianshuo Zhou
BUPT, China
S
Shiqi Zhong
BUPT, China
Siyu Ren
Siyu Ren
Shanghai Jiao Tong University
NLP
M
Mingxiang Chen
BUPT, China
X
Xiangchao Meng
BUPT, China
Y
Yuxin Zhang
YSU, China
Yanlin Li
Yanlin Li
Carnegie Mellon University
Computer Security
Lei Ren
Lei Ren
Li Auto
NLP、LLM、VLM
W
Wei Chen
Li Auto Inc., China
Z
Zhiyuan Huang
SenseTime Research, China
M
Mingjie Zhan
SenseTime Research, China
X
Xiaojie Wang
BUPT, China
Fangxiang Feng
Fangxiang Feng
Beijing University of Posts and Telecommunications
Multimodal LearningImage Synthesis