MMErroR: A Benchmark for Erroneous Reasoning in Vision-Language Models

📅 2026-01-06

🏛️ arXiv.org

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

Current vision-language models struggle to identify the specific types of errors in their own reasoning and lack the capability to detect and categorize erroneous inference processes. To address this limitation, this work proposes MMErroR—a multimodal benchmark comprising 2,013 samples spanning six broad domains and 24 subdomains, each embedding a single, coherent reasoning error. Introducing an error-centric, reasoning-oriented evaluation paradigm, this study moves beyond conventional answer-only correctness metrics and establishes a systematic error taxonomy to enable fine-grained model diagnostics. Evaluation across 20 state-of-the-art models reveals that even the best-performing model, Gemini-3.0-Pro, achieves only 66.47% accuracy in error-type identification, underscoring the significant challenge this task presents.

Technology Category

Application Category

📝 Abstract

Recent advances in Vision-Language Models (VLMs) have improved performance in multi-modal learning, raising the question of whether these models truly understand the content they process. Crucially, can VLMs detect when a reasoning process is wrong and identify its error type? To answer this, we present MMErroR, a multi-modal benchmark of 2,013 samples, each embedding a single coherent reasoning error. These samples span 24 subdomains across six top-level domains, ensuring broad coverage and taxonomic richness. Unlike existing benchmarks that focus on answer correctness, MMErroR targets a process-level, error-centric evaluation that requires models to detect incorrect reasoning and classify the error type within both visual and linguistic contexts. We evaluate 20 advanced VLMs, even the best model (Gemini-3.0-Pro) classifies the error in only 66.47\% of cases, underscoring the challenge of identifying erroneous reasoning. Furthermore, the ability to accurately identify errors offers valuable insights into the capabilities of multi-modal reasoning models. Project Page: https://mmerror-benchmark.github.io

Problem

Research questions and friction points this paper is trying to address.

Vision-Language Models

Erroneous Reasoning

Error Detection

Multi-modal Reasoning

Reasoning Evaluation

Innovation

Methods, ideas, or system contributions that make the work stand out.

erroneous reasoning

vision-language models

error detection

multimodal benchmark

reasoning evaluation

🔎 Similar Papers

No similar papers found.

Authors to Follow