VERHallu: Evaluating and Mitigating Event Relation Hallucination in Video Large Language Models

📅 2026-01-15

📈 Citations: 0

✨ Influential: 0

career value

210K/year

🤖 AI Summary

This work addresses the susceptibility of video large language models to hallucinations in reasoning about event relationships—such as causality, temporal order, and sub-event structures—a problem that has lacked systematic investigation. We formalize and evaluate video event relationship hallucination for the first time, introducing VERHallu, the first dedicated benchmark encompassing relation classification, question answering, and counterfactual QA tasks, featuring counterintuitive scenarios and human-annotated bias labels. To mitigate these hallucinations, we propose a Keyframe Propagation (KFP) strategy that dynamically redistributes frame-level attention at intermediate layers, enhancing the model’s awareness of multi-event context without compromising inference speed. Experimental results demonstrate that our approach significantly alleviates event relationship hallucinations and improves accuracy in understanding complex events.

Technology Category

Application Category

📝 Abstract

Video Large Language Models (VideoLLMs) exhibit various types of hallucinations. Existing research has primarily focused on hallucinations involving the presence of events, objects, and scenes in videos, while largely neglecting event relation hallucination. In this paper, we introduce a novel benchmark for evaluating the Video Event Relation Hallucination, named VERHallu. This benchmark focuses on causal, temporal, and subevent relations between events, encompassing three types of tasks: relation classification, question answering, and counterfactual question answering, for a comprehensive evaluation of event relation hallucination. Additionally, it features counterintuitive video scenarios that deviate from typical pretraining distributions, with each sample accompanied by human-annotated candidates covering both vision-language and pure language biases. Our analysis reveals that current state-of-the-art VideoLLMs struggle with dense-event relation reasoning, often relying on prior knowledge due to insufficient use of frame-level cues. Although these models demonstrate strong grounding capabilities for key events, they often overlook the surrounding subevents, leading to an incomplete and inaccurate understanding of event relations. To tackle this, we propose a Key-Frame Propagating (KFP) strategy, which reallocates frame-level attention within intermediate layers to enhance multi-event understanding. Experiments show it effectively mitigates the event relation hallucination without affecting inference speed.

Problem

Research questions and friction points this paper is trying to address.

event relation hallucination

Video Large Language Models

causal relations

temporal relations

subevent relations

Innovation

Methods, ideas, or system contributions that make the work stand out.

event relation hallucination

Video Large Language Models

VERHallu benchmark