🤖 AI Summary
Existing visual reasoning benchmarks evaluate only answer correctness, neglecting the logical consistency of the reasoning process. Method: PRISM-Bench introduces the first fine-grained reasoning error localization benchmark for multimodal large language models (MLLMs), built upon symbolic, geometric, and analogical visual puzzles. It constructs chain-of-thought (CoT) samples with single-step, human-injected errors and requires models to precisely identify the *first* erroneous step. Its core innovation is the decoupling of answer generation from reasoning verification, establishing a novel “reasoning chain diagnosis” paradigm. Contribution/Results: Evaluation across mainstream MLLMs reveals that while these models often generate fluent yet incorrect CoTs, they consistently fail to detect even elementary logical fallacies—exposing critical weaknesses in their reasoning robustness. PRISM-Bench significantly enhances assessment fidelity for logical consistency and error detection capability in multimodal reasoning.
📝 Abstract
We introduce extbf{PRISM-Bench}, a benchmark of puzzle-based visual challenges designed to evaluate not only whether models can solve problems, but how their reasoning unfolds. Unlike prior evaluations that measure only final-answer accuracy, PRISM-Bench introduces a diagnostic task: given a visual puzzle and a step-by-step chain-of-thought (CoT) containing exactly one error, models must identify the first incorrect step. This setting enables fine-grained assessment of logical consistency, error detection, and visual reasoning. The puzzles in PRISM-Bench require multi-step symbolic, geometric, and analogical reasoning, resisting shortcuts based on superficial pattern matching. Evaluations across state-of-the-art MLLMs reveal a persistent gap between fluent generation and faithful reasoning: models that produce plausible CoTs often fail to locate simple logical faults. By disentangling answer generation from reasoning verification, PRISM-Bench offers a sharper lens on multimodal reasoning competence and underscores the need for diagnostic evaluation protocols in the development of trustworthy MLLMs.