MMErroR: A Benchmark for Erroneous Reasoning in Vision-Language Models

πŸ“… 2026-01-06
πŸ›οΈ arXiv.org
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
Current vision-language models struggle to identify the specific types of errors in their own reasoning and lack the capability to detect and categorize erroneous inference processes. To address this limitation, this work proposes MMErroRβ€”a multimodal benchmark comprising 2,013 samples spanning six broad domains and 24 subdomains, each embedding a single, coherent reasoning error. Introducing an error-centric, reasoning-oriented evaluation paradigm, this study moves beyond conventional answer-only correctness metrics and establishes a systematic error taxonomy to enable fine-grained model diagnostics. Evaluation across 20 state-of-the-art models reveals that even the best-performing model, Gemini-3.0-Pro, achieves only 66.47% accuracy in error-type identification, underscoring the significant challenge this task presents.

Technology Category

Application Category

πŸ“ Abstract
Recent advances in Vision-Language Models (VLMs) have improved performance in multi-modal learning, raising the question of whether these models truly understand the content they process. Crucially, can VLMs detect when a reasoning process is wrong and identify its error type? To answer this, we present MMErroR, a multi-modal benchmark of 2,013 samples, each embedding a single coherent reasoning error. These samples span 24 subdomains across six top-level domains, ensuring broad coverage and taxonomic richness. Unlike existing benchmarks that focus on answer correctness, MMErroR targets a process-level, error-centric evaluation that requires models to detect incorrect reasoning and classify the error type within both visual and linguistic contexts. We evaluate 20 advanced VLMs, even the best model (Gemini-3.0-Pro) classifies the error in only 66.47\% of cases, underscoring the challenge of identifying erroneous reasoning. Furthermore, the ability to accurately identify errors offers valuable insights into the capabilities of multi-modal reasoning models. Project Page: https://mmerror-benchmark.github.io
Problem

Research questions and friction points this paper is trying to address.

Vision-Language Models
Erroneous Reasoning
Error Detection
Multi-modal Reasoning
Reasoning Evaluation
Innovation

Methods, ideas, or system contributions that make the work stand out.

erroneous reasoning
vision-language models
error detection
multimodal benchmark
reasoning evaluation
πŸ”Ž Similar Papers
No similar papers found.
Y
Yang Shi
Guangdong University of Technology
Y
Yifeng Xie
Hong Kong Baptist University
M
Minzhe Guo
Guangdong University of Technology
L
Liangsi Lu
Guangdong University of Technology
M
Mingxuan Huang
Sun Yat-sen University
Jingchao Wang
Jingchao Wang
East China Normal University
AI
Z
Zhihong Zhu
Peking University
Boyan Xu
Boyan Xu
Guangdong University of Technology
Text-to-SQLLarge Language ModelSentiment AnalysisMachine Learning
Z
Zhiqi Huang
Peking University