π€ AI Summary
Current vision-language models struggle to identify the specific types of errors in their own reasoning and lack the capability to detect and categorize erroneous inference processes. To address this limitation, this work proposes MMErroRβa multimodal benchmark comprising 2,013 samples spanning six broad domains and 24 subdomains, each embedding a single, coherent reasoning error. Introducing an error-centric, reasoning-oriented evaluation paradigm, this study moves beyond conventional answer-only correctness metrics and establishes a systematic error taxonomy to enable fine-grained model diagnostics. Evaluation across 20 state-of-the-art models reveals that even the best-performing model, Gemini-3.0-Pro, achieves only 66.47% accuracy in error-type identification, underscoring the significant challenge this task presents.
π Abstract
Recent advances in Vision-Language Models (VLMs) have improved performance in multi-modal learning, raising the question of whether these models truly understand the content they process. Crucially, can VLMs detect when a reasoning process is wrong and identify its error type? To answer this, we present MMErroR, a multi-modal benchmark of 2,013 samples, each embedding a single coherent reasoning error. These samples span 24 subdomains across six top-level domains, ensuring broad coverage and taxonomic richness. Unlike existing benchmarks that focus on answer correctness, MMErroR targets a process-level, error-centric evaluation that requires models to detect incorrect reasoning and classify the error type within both visual and linguistic contexts. We evaluate 20 advanced VLMs, even the best model (Gemini-3.0-Pro) classifies the error in only 66.47\% of cases, underscoring the challenge of identifying erroneous reasoning. Furthermore, the ability to accurately identify errors offers valuable insights into the capabilities of multi-modal reasoning models. Project Page: https://mmerror-benchmark.github.io