Reasoning Matters: Mitigate Hallucination in Multimodal Large Reasoning Models via Reasoning-Conditioned Preference Optimization

📅 2026-05-26
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the susceptibility of multimodal large reasoning models to hallucination in complex vision-language tasks, a limitation exacerbated by insufficient supervision over reasoning chains in existing preference optimization methods. To tackle this, the paper proposes Reasoning-Conditioned Direct Preference Optimization (RC-DPO), which explicitly models preference signals conditioned on chains-of-thought (CoT). By comparing the quality of different CoTs that lead to the same answer, RC-DPO decouples reasoning from final answer optimization. The authors further introduce an efficient CoT token pruning strategy that integrates Monte Carlo Tree Search with attention guidance to generate high-quality preference data. Experiments across multiple multimodal large language models and benchmarks demonstrate that RC-DPO significantly mitigates hallucination while enhancing the logical consistency and reliability of model reasoning.
📝 Abstract
Multimodal Large Reasoning Models introduce the reasoning paradigm, demonstrating strong capabilities on complex vision-language tasks. However, they still suffer from severe hallucinations. Existing training-based methods typically mitigate hallucinations through response-level direct preference optimization (DPO), where the Chain-of-Thought (CoT) and the final answer are treated as a monolithic output and optimized jointly. We reveal that this formulation performs similarly to answer-only optimization, suggesting that it primarily learns answer-level preference, while leaving CoT-level supervision insufficiently exploited. To address this issue, we explicitly formulate a CoT-oriented preference term and derive Reasoning-Conditioned Direct Preference Optimization (RC-DPO), which models the CoT as a condition for answer generation and contrasts the preference for the same preferred answer under different CoT conditions, promoting answer-supportive reasoning chain alignment. To further improve optimization, we introduce a reasoning-enhanced preference data generation strategy that employs Monte Carlo Tree Search to discover visually grounded and logically consistent CoTs as positive samples, and attention-guided CoT token pruning to construct negative ones. Extensive experiments across various models and benchmarks show that RC-DPO effectively mitigates hallucinations and improves the reliability of the multimodal reasoning process.
Problem

Research questions and friction points this paper is trying to address.

hallucination
multimodal reasoning
Chain-of-Thought
preference optimization
vision-language tasks
Innovation

Methods, ideas, or system contributions that make the work stand out.

Reasoning-Conditioned DPO
Chain-of-Thought Alignment
Multimodal Hallucination Mitigation
Preference Optimization
Monte Carlo Tree Search