Journey Before Destination: On the importance of Visual Faithfulness in Slow Thinking

📅 2025-12-13
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing reasoning-augmented vision-language models (VLMs) generate chain-of-thought (CoT) outputs suffering from low visual fidelity: intermediate perceptual steps either diverge from image evidence (“hallucinatory reasoning”) or remain faithful yet yield incorrect final predictions—defects invisible to conventional end-answer accuracy evaluation. Method: We introduce “visual fidelity” as a novel, independent evaluation dimension and propose a training-free, reference-free framework that decomposes CoT into perceptual and reasoning steps and self-evaluates each step. We further design a lightweight zero-shot self-reflection mechanism enabling localized regeneration of perceptual steps. Contribution/Results: Evaluated across multiple VLMs and perception-intensive benchmarks, our method significantly reduces non-faithful perception rates while preserving end-answer accuracy, thereby enhancing the reliability and interpretability of multimodal reasoning.

Technology Category

Application Category

📝 Abstract
Reasoning-augmented vision language models (VLMs) generate explicit chains of thought that promise greater capability and transparency but also introduce new failure modes: models may reach correct answers via visually unfaithful intermediate steps, or reason faithfully yet fail on the final prediction. Standard evaluations that only measure final-answer accuracy cannot distinguish these behaviors. We introduce the visual faithfulness of reasoning chains as a distinct evaluation dimension, focusing on whether the perception steps of a reasoning chain are grounded in the image. We propose a training- and reference-free framework that decomposes chains into perception versus reasoning steps and uses off-the-shelf VLM judges for step-level faithfulness, additionally verifying this approach through a human meta-evaluation. Building on this metric, we present a lightweight self-reflection procedure that detects and locally regenerates unfaithful perception steps without any training. Across multiple reasoning-trained VLMs and perception-heavy benchmarks, our method reduces Unfaithful Perception Rate while preserving final-answer accuracy, improving the reliability of multimodal reasoning.
Problem

Research questions and friction points this paper is trying to address.

Evaluates visual faithfulness in reasoning chains of VLMs
Distinguishes perception from reasoning steps without training
Improves reliability by detecting and regenerating unfaithful steps
Innovation

Methods, ideas, or system contributions that make the work stand out.

Proposes training-free framework for evaluating visual reasoning faithfulness
Decomposes reasoning chains into perception versus reasoning steps
Uses lightweight self-reflection to regenerate unfaithful perception steps
🔎 Similar Papers
No similar papers found.