PRISM-Bench: A Benchmark of Puzzle-Based Visual Tasks with CoT Error Detection

📅 2025-10-27

📈 Citations: 0

✨ Influential: 0

career value

179K/year

🤖 AI Summary

Existing visual reasoning benchmarks evaluate only answer correctness, neglecting the logical consistency of the reasoning process. Method: PRISM-Bench introduces the first fine-grained reasoning error localization benchmark for multimodal large language models (MLLMs), built upon symbolic, geometric, and analogical visual puzzles. It constructs chain-of-thought (CoT) samples with single-step, human-injected errors and requires models to precisely identify the *first* erroneous step. Its core innovation is the decoupling of answer generation from reasoning verification, establishing a novel “reasoning chain diagnosis” paradigm. Contribution/Results: Evaluation across mainstream MLLMs reveals that while these models often generate fluent yet incorrect CoTs, they consistently fail to detect even elementary logical fallacies—exposing critical weaknesses in their reasoning robustness. PRISM-Bench significantly enhances assessment fidelity for logical consistency and error detection capability in multimodal reasoning.

Technology Category

Application Category

📝 Abstract

We introduce extbf{PRISM-Bench}, a benchmark of puzzle-based visual challenges designed to evaluate not only whether models can solve problems, but how their reasoning unfolds. Unlike prior evaluations that measure only final-answer accuracy, PRISM-Bench introduces a diagnostic task: given a visual puzzle and a step-by-step chain-of-thought (CoT) containing exactly one error, models must identify the first incorrect step. This setting enables fine-grained assessment of logical consistency, error detection, and visual reasoning. The puzzles in PRISM-Bench require multi-step symbolic, geometric, and analogical reasoning, resisting shortcuts based on superficial pattern matching. Evaluations across state-of-the-art MLLMs reveal a persistent gap between fluent generation and faithful reasoning: models that produce plausible CoTs often fail to locate simple logical faults. By disentangling answer generation from reasoning verification, PRISM-Bench offers a sharper lens on multimodal reasoning competence and underscores the need for diagnostic evaluation protocols in the development of trustworthy MLLMs.

Problem

Research questions and friction points this paper is trying to address.

Evaluates multimodal models' reasoning consistency and error detection

Diagnoses logical faults in chain-of-thought visual puzzle solving

Measures gap between fluent generation and faithful visual reasoning

Innovation

Methods, ideas, or system contributions that make the work stand out.

Benchmark with puzzle-based visual reasoning tasks

Detects first incorrect step in chain-of-thought reasoning

Disentangles answer generation from reasoning verification

🔎 Similar Papers

Are LLMs Good Cryptic Crossword Solvers?