On the Faithfulness of Visual Thinking: Measurement and Enhancement

📅 2025-10-27
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the visual unfaithfulness problem in multimodal chain-of-thought (MCoT) reasoning generated by large vision-language models (LVLMs) under reinforcement fine-tuning (RFT): although MCoT texts appear image-grounded, they often ignore or misinterpret visual inputs, yielding correct answers based on incorrect evidence. Through causal-driven intervention analysis, we first systematically uncover the unreliability and insufficiency of visual evidence in MCoT. To address this, we propose SCCM—a self-supervised learning strategy that requires no human annotations—leveraging sufficient component modeling to automatically identify and strengthen reliance on visual cues. SCCM ensures generated vision-relevant text is both minimal and sufficient to independently support reasoning conclusions. The method is plug-and-play compatible with diverse RFT pipelines and significantly improves visual faithfulness on fine-grained perception and reasoning benchmarks, while preserving or even exceeding original task performance.

Technology Category

Application Category

📝 Abstract
Recent large vision-language models (LVLMs) can generate vision-text multimodal chain-of-thought (MCoT) traces after reinforcement fine-tuning (RFT). However, we observe that the visual information incorporated in MCoT is often inaccurate, though still yield correct answers, indicating a lack of faithfulness in the MCoT reasoning process. We attribute this unfaithfulness to the RL reward in RFT, which solely incentivizes the format of interleaved vision-text cues, ie, it encourages the model to incorporate visual information into its text reasoning steps without considering the correctness of the visual information. In this paper, we first probe the faithfulness of MCoT by measuring how much the prediction changes when its visual and textual thoughts are intervened. Surprisingly, the model's predictions remain nearly unchanged under visual intervention but change significantly under textual intervention, indicating that the visual evidence is largely ignored. To further analyze visual information, we introduce an automated LVLM-based evaluation metric that quantifies the faithfulness of visual cues from two perspectives: reliability and sufficiency. Our evaluation reveals that the visual information in current MCoT traces is simultaneously unreliable and insufficient. To address this issue, we propose a novel MCoT learning strategy termed Sufficient-Component Cause Model (SCCM) learning. This approach encourages the MCoT to generate sufficient yet minimal visual components that are independently capable of leading to correct answers. We note that the proposed SCCM is annotation-free and compatible with various RFT for MCoT in a plug-and-play manner. Empirical results demonstrate that SCCM consistently improves the visual faithfulness across a suite of fine-grained perception and reasoning benchmarks. Code is available at https://github.com/EugeneLiu01/Faithful_Thinking_with_Image.
Problem

Research questions and friction points this paper is trying to address.

Evaluating visual reasoning faithfulness in multimodal chain-of-thought models
Addressing unreliable and insufficient visual information in reasoning traces
Developing annotation-free learning to enhance visual component sufficiency
Innovation

Methods, ideas, or system contributions that make the work stand out.

Automated metric evaluates visual cue reliability and sufficiency
SCCM learning generates minimal sufficient visual components
Plug-and-play annotation-free enhancement for MCoT faithfulness
🔎 Similar Papers
No similar papers found.