Seeing Through the Chain: Mitigate Hallucination in Multimodal Reasoning Models via CoT Compression and Contrastive Preference Optimization

📅 2026-02-03
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the tendency of multimodal reasoning models to over-rely on linguistic priors and neglect visual inputs when equipped with reasoning mechanisms, thereby exacerbating hallucination. To mitigate this issue, the authors propose C3PO, a novel framework that uniquely integrates chain-of-thought compression with inductive contrastive preference learning. By compressing redundant reasoning tokens to preserve essential visual information and leveraging high-quality AI-generated feedback to construct contrastive preference signals, C3PO explicitly suppresses multimodal hallucinations. The approach establishes a theoretically grounded training paradigm and demonstrates consistent and significant hallucination reduction across multiple state-of-the-art multimodal models and benchmark datasets.

Technology Category

Application Category

📝 Abstract
While multimodal reasoning models (MLRMs) have exhibited impressive capabilities, they remain prone to hallucinations, and effective solutions are still underexplored. In this paper, we experimentally analyze the hallucination cause and propose C3PO, a training-based mitigation framework comprising \textbf{C}hain-of-Thought \textbf{C}ompression and \textbf{C}ontrastive \textbf{P}reference \textbf{O}ptimization. Firstly, we identify that introducing reasoning mechanisms exacerbates models'reliance on language priors while overlooking visual inputs, which can produce CoTs with reduced visual cues but redundant text tokens. To this end, we propose to selectively filter redundant thinking tokens for a more compact and signal-efficient CoT representation that preserves task-relevant information while suppressing noise. In addition, we observe that the quality of the reasoning trace largely determines whether hallucination emerges in subsequent responses. To leverage this insight, we introduce a reasoning-enhanced preference tuning scheme that constructs training pairs using high-quality AI feedback. We further design a multimodal hallucination-inducing mechanism that elicits models'inherent hallucination patterns via carefully crafted inducers, yielding informative negative signals for contrastive correction. We provide theoretical justification for the effectiveness and demonstrate consistent hallucination reduction across diverse MLRMs and benchmarks.
Problem

Research questions and friction points this paper is trying to address.

multimodal reasoning
hallucination
Chain-of-Thought
visual grounding
model reliability
Innovation

Methods, ideas, or system contributions that make the work stand out.

Chain-of-Thought Compression
Contrastive Preference Optimization
Multimodal Hallucination Mitigation
Reasoning-Enhanced Tuning
Hallucination Induction
🔎 Similar Papers
No similar papers found.