Understanding and Mitigating Hallucinations in Multimodal Chain-of-Thought Models

📅 2026-03-28
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the severe hallucination issues prevalent in multimodal chain-of-thought models during complex visual reasoning, whose underlying causes remain poorly understood. The study systematically identifies, for the first time, that such hallucinations primarily originate from the associative reasoning phase—referred to as “divergent thinking.” To mitigate this, the authors propose a lightweight decoding intervention mechanism that detects divergent-thinking steps and leverages multimodal attention analysis combined with generation control to precisely suppress hallucinatory outputs during decoding. This approach seamlessly integrates into existing frameworks, consistently outperforming state-of-the-art methods across multiple benchmarks and further enhancing the efficacy of other hallucination mitigation strategies.
📝 Abstract
Multimodal Chain-of-Thought (MCoT) models have demonstrated impressive capability in complex visual reasoning tasks. Unfortunately, recent studies reveal that they suffer from severe hallucination problems due to diminished visual attention during the generation process. However, visual attention decay is a well-studied problem in Large Vision-Language Models (LVLMs). Considering the fundamental differences in reasoning processes between MCoT models and traditional LVLMs, we raise a basic question: Whether MCoT models have unique causes of hallucinations? To answer this question, we systematically investigate the hallucination patterns of MCoT models and find that fabricated texts are primarily generated in associative reasoning steps, which we term divergent thinking. Leveraging these insights, we introduce a simple yet effective strategy that can effectively localize divergent thinking steps and intervene in the decoding process to mitigate hallucinations. Extensive experiments show that our method outperforms existing methods by a large margin. More importantly, our proposed method can be conveniently integrated with other hallucination mitigation methods and further boost their performance. The code is publicly available at https://github.com/ASGO-MM/MCoT-hallucination.
Problem

Research questions and friction points this paper is trying to address.

Multimodal Chain-of-Thought
hallucination
visual reasoning
divergent thinking
visual attention
Innovation

Methods, ideas, or system contributions that make the work stand out.

Multimodal Chain-of-Thought
Hallucination Mitigation
Divergent Thinking
Visual Reasoning
Decoding Intervention
🔎 Similar Papers
No similar papers found.
Ji Ma
Ji Ma
Northwestern Polytechnical University
Artificial Intelligence
W
Wei Suo
School of Computer Science and Ningbo Institute, Northwestern Polytechnical University, China; National Engineering Laboratory for Integrated Aero-Space-Ground-Ocean Big Data Application Technology, China
Peng Wang
Peng Wang
School of Computer Science, Northwestern Polytechnical University, China
Computer VisionMachine LearningArtificial Intelligence
Yanning Zhang
Yanning Zhang
Northwestern Polytechnical University
Computer Vision