EventHallusion: Diagnosing Event Hallucinations in Video LLMs

📅 2024-09-25

🏛️ arXiv.org

📈 Citations: 6

✨ Influential: 0

career value

216K/year

🤖 AI Summary

Video large language models (LLMs) suffer from severe hallucinations in event understanding, exacerbated by linguistic priors and vision-language misalignment. To address this, we introduce EventHallusion—the first benchmark dedicated to diagnosing event-level hallucinations in video LLMs—systematically defining, quantifying, and attributing such hallucinations from dual perspectives: linguistic priors and cross-modal biases. We propose temporal contrastive decoding (TCD), a novel, fine-tuning-free inference-time method that explicitly models temporal cues to suppress hallucination. Leveraging event-driven adversarial video construction and a multidimensional evaluation framework, we validate TCD across eight open-source and two closed-source models. Results show that TCD significantly enhances the reliability of event understanding, improving accuracy by over 15% for several models. This work establishes a new paradigm for trustworthy evaluation and robust reasoning in video LLMs.

Technology Category

Application Category

📝 Abstract

Recently, Multimodal Large Language Models (MLLMs) have made significant progress in the video comprehension field. Despite remarkable content reasoning and instruction following capabilities they demonstrated, the hallucination problem of these VideoLLMs is less explored compared with its counterpart in the image domain. To mitigate this gap, we propose EventHallusion, a novel benchmark that focuses on assessing the VideoLLMs' hallucination toward event, the crux of video analysis. From a hallucination attribution perspective, our EventHallusion benchmark is curated to assess a VideoLLM's susceptibility toward language priors and vision-language biases. On the other hand, we also propose a simple yet effective method, called Temporal Contrastive Decoding (TCD), to tackle the hallucination problems of VideoLLMs. The proposed TCD method rectifies the model's bias toward its priors during the decoding stage by comparing the original video with a modified version, in which temporal cues are disrupted. Through comprehensive evaluation of eight open-source and two closed-source VideoLLMs on the proposed EventHallusion benchmark, we observe that the open-source models suffer significantly from hallucination problems, whereas the closed-source ones perform markedly better. By further equipping open-source VideoLLMs with the proposed TCD approach, evident performance improvements are achieved across most metrics in the EventHallusion benchmark. Our codes and benchmark data are available at https://github.com/Stevetich/EventHallusion.

Problem

Research questions and friction points this paper is trying to address.

Multimodal Large Language Models

Visual Content Understanding

Bias in Language and Vision

Innovation

Methods, ideas, or system contributions that make the work stand out.

EventHallusion

Time Contrastive Decoding (TCD)

Multimodal Language Model

🔎 Similar Papers

No similar papers found.