EventHallusion: Diagnosing Event Hallucinations in Video LLMs

📅 2024-09-25
🏛️ arXiv.org
📈 Citations: 6
Influential: 0
📄 PDF
🤖 AI Summary
Video large language models (LLMs) suffer from severe hallucinations in event understanding, exacerbated by linguistic priors and vision-language misalignment. To address this, we introduce EventHallusion—the first benchmark dedicated to diagnosing event-level hallucinations in video LLMs—systematically defining, quantifying, and attributing such hallucinations from dual perspectives: linguistic priors and cross-modal biases. We propose temporal contrastive decoding (TCD), a novel, fine-tuning-free inference-time method that explicitly models temporal cues to suppress hallucination. Leveraging event-driven adversarial video construction and a multidimensional evaluation framework, we validate TCD across eight open-source and two closed-source models. Results show that TCD significantly enhances the reliability of event understanding, improving accuracy by over 15% for several models. This work establishes a new paradigm for trustworthy evaluation and robust reasoning in video LLMs.

Technology Category

Application Category

📝 Abstract
Recently, Multimodal Large Language Models (MLLMs) have made significant progress in the video comprehension field. Despite remarkable content reasoning and instruction following capabilities they demonstrated, the hallucination problem of these VideoLLMs is less explored compared with its counterpart in the image domain. To mitigate this gap, we propose EventHallusion, a novel benchmark that focuses on assessing the VideoLLMs' hallucination toward event, the crux of video analysis. From a hallucination attribution perspective, our EventHallusion benchmark is curated to assess a VideoLLM's susceptibility toward language priors and vision-language biases. On the other hand, we also propose a simple yet effective method, called Temporal Contrastive Decoding (TCD), to tackle the hallucination problems of VideoLLMs. The proposed TCD method rectifies the model's bias toward its priors during the decoding stage by comparing the original video with a modified version, in which temporal cues are disrupted. Through comprehensive evaluation of eight open-source and two closed-source VideoLLMs on the proposed EventHallusion benchmark, we observe that the open-source models suffer significantly from hallucination problems, whereas the closed-source ones perform markedly better. By further equipping open-source VideoLLMs with the proposed TCD approach, evident performance improvements are achieved across most metrics in the EventHallusion benchmark. Our codes and benchmark data are available at https://github.com/Stevetich/EventHallusion.
Problem

Research questions and friction points this paper is trying to address.

Multimodal Large Language Models
Visual Content Understanding
Bias in Language and Vision
Innovation

Methods, ideas, or system contributions that make the work stand out.

EventHallusion
Time Contrastive Decoding (TCD)
Multimodal Language Model
🔎 Similar Papers
No similar papers found.
J
Jiacheng Zhang
Shanghai Key Lab of Intell. Info. Processing, School of CS, Fudan University; Shanghai Collaborative Innovation Center on Intelligent Visual Computing
Y
Yang Jiao
Shanghai Key Lab of Intell. Info. Processing, School of CS, Fudan University; Shanghai Collaborative Innovation Center on Intelligent Visual Computing
S
Shaoxiang Chen
Meituan
Jingjing Chen
Jingjing Chen
Fudan University
MultimediaComputer VisionMachine LearningPattern recognition