Exploring Hallucination of Large Multimodal Models in Video Understanding: Benchmark, Analysis and Mitigation

📅 2025-03-25

📈 Citations: 0

✨ Influential: 0

career value

222K/year

🤖 AI Summary

This work addresses hallucination—plausible yet factually incorrect responses—in large multimodal models (LMMs) for video understanding, which critically undermines their reliability. To systematically study this issue, we introduce HAVEN, the first multidimensional benchmark for video hallucination, comprising 6K high-quality questions designed along three axes: causal reasoning, dimensional diversity, and question-type coverage. We conduct a comprehensive evaluation of 16 state-of-the-art LMMs, identifying seven key factors influencing hallucination. Further, we propose the Video-Thinking Model, integrating Supervised Reasoning Fine-Tuning (SRFT) and Thought-Chain Direct Preference Optimization (TDPO) to suppress hallucination at the reasoning source. Experiments demonstrate that our approach improves hallucination detection accuracy by 7.65% and reduces bias scores by 4.5%. All code and data are publicly released.

Technology Category

Application Category

📝 Abstract

The hallucination of large multimodal models (LMMs), providing responses that appear correct but are actually incorrect, limits their reliability and applicability. This paper aims to study the hallucination problem of LMMs in video modality, which is dynamic and more challenging compared to static modalities like images and text. From this motivation, we first present a comprehensive benchmark termed HAVEN for evaluating hallucinations of LMMs in video understanding tasks. It is built upon three dimensions, i.e., hallucination causes, hallucination aspects, and question formats, resulting in 6K questions. Then, we quantitatively study 7 influential factors on hallucinations, e.g., duration time of videos, model sizes, and model reasoning, via experiments of 16 LMMs on the presented benchmark. In addition, inspired by recent thinking models like OpenAI o1, we propose a video-thinking model to mitigate the hallucinations of LMMs via supervised reasoning fine-tuning (SRFT) and direct preference optimization (TDPO)-- where SRFT enhances reasoning capabilities while TDPO reduces hallucinations in the thinking process. Extensive experiments and analyses demonstrate the effectiveness. Remarkably, it improves the baseline by 7.65% in accuracy on hallucination evaluation and reduces the bias score by 4.5%. The code and data are public at https://github.com/Hongcheng-Gao/HAVEN.

Problem

Research questions and friction points this paper is trying to address.

Study hallucination in large multimodal models for video understanding

Evaluate 7 factors affecting hallucinations via 16 LMMs on HAVEN benchmark

Mitigate hallucinations via video-thinking model with SRFT and TDPO

Innovation

Methods, ideas, or system contributions that make the work stand out.

HAVEN benchmark for video hallucination evaluation

Video-thinking model with SRFT and TDPO

Quantitative study of 7 hallucination factors

🔎 Similar Papers

Hallucination of Multimodal Large Language Models: A Survey