SPD-Faith Bench: Diagnosing and Improving Faithfulness in Chain-of-Thought for Multimodal Large Language Models

📅 2026-02-08

📈 Citations: 0

✨ Influential: 0

career value

200K/year

🤖 AI Summary

This work addresses the hierarchical unfaithfulness in multimodal large language models during chain-of-thought reasoning, which often stems from a disconnect between perception and reasoning. To this end, the authors introduce SPD-Faith Bench, the first diagnostic benchmark specifically designed to evaluate reasoning-level faithfulness, leveraging fine-grained image difference tasks to enforce explicit visual comparison. The benchmark reveals two systematic failure modes—perceptual blind spots and perception-reasoning misalignment—traced to visual attention decay and representational shifts in the residual stream. Furthermore, the paper proposes SAGE, a training-free framework that calibrates visual evidence to significantly enhance consistency between perception and reasoning. Experiments on mainstream models underscore the necessity of faithfulness evaluation beyond mere answer correctness.

Technology Category

Application Category

📝 Abstract

Chain-of-Thought reasoning is widely used to improve the interpretability of multimodal large language models (MLLMs), yet the faithfulness of the generated reasoning traces remains unclear. Prior work has mainly focused on perceptual hallucinations, leaving reasoning level unfaithfulness underexplored. To isolate faithfulness from linguistic priors, we introduce SPD-Faith Bench, a diagnostic benchmark based on fine-grained image difference reasoning that enforces explicit visual comparison. Evaluations on state-of-the-art MLLMs reveal two systematic failure modes, perceptual blindness and perception-reasoning dissociation. We trace these failures to decaying visual attention and representation shifts in the residual stream. Guided by this analysis, we propose SAGE, a train-free visual evidence-calibrated framework that improves visual routing and aligns reasoning with perception. Our results highlight the importance of explicitly evaluating faithfulness beyond response correctness. Our benchmark and codes are available at https://github.com/Johanson-colab/SPD-Faith-Bench.

Problem

Research questions and friction points this paper is trying to address.

faithfulness

chain-of-thought

multimodal large language models

reasoning unfaithfulness

visual perception

Innovation

Methods, ideas, or system contributions that make the work stand out.

faithfulness

multimodal large language models

chain-of-thought