VERHallu: Evaluating and Mitigating Event Relation Hallucination in Video Large Language Models

📅 2026-01-15
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the susceptibility of video large language models to hallucinations in reasoning about event relationships—such as causality, temporal order, and sub-event structures—a problem that has lacked systematic investigation. We formalize and evaluate video event relationship hallucination for the first time, introducing VERHallu, the first dedicated benchmark encompassing relation classification, question answering, and counterfactual QA tasks, featuring counterintuitive scenarios and human-annotated bias labels. To mitigate these hallucinations, we propose a Keyframe Propagation (KFP) strategy that dynamically redistributes frame-level attention at intermediate layers, enhancing the model’s awareness of multi-event context without compromising inference speed. Experimental results demonstrate that our approach significantly alleviates event relationship hallucinations and improves accuracy in understanding complex events.

Technology Category

Application Category

📝 Abstract
Video Large Language Models (VideoLLMs) exhibit various types of hallucinations. Existing research has primarily focused on hallucinations involving the presence of events, objects, and scenes in videos, while largely neglecting event relation hallucination. In this paper, we introduce a novel benchmark for evaluating the Video Event Relation Hallucination, named VERHallu. This benchmark focuses on causal, temporal, and subevent relations between events, encompassing three types of tasks: relation classification, question answering, and counterfactual question answering, for a comprehensive evaluation of event relation hallucination. Additionally, it features counterintuitive video scenarios that deviate from typical pretraining distributions, with each sample accompanied by human-annotated candidates covering both vision-language and pure language biases. Our analysis reveals that current state-of-the-art VideoLLMs struggle with dense-event relation reasoning, often relying on prior knowledge due to insufficient use of frame-level cues. Although these models demonstrate strong grounding capabilities for key events, they often overlook the surrounding subevents, leading to an incomplete and inaccurate understanding of event relations. To tackle this, we propose a Key-Frame Propagating (KFP) strategy, which reallocates frame-level attention within intermediate layers to enhance multi-event understanding. Experiments show it effectively mitigates the event relation hallucination without affecting inference speed.
Problem

Research questions and friction points this paper is trying to address.

event relation hallucination
Video Large Language Models
causal relations
temporal relations
subevent relations
Innovation

Methods, ideas, or system contributions that make the work stand out.

event relation hallucination
Video Large Language Models
VERHallu benchmark
Key-Frame Propagating
counterfactual reasoning
🔎 Similar Papers
No similar papers found.
Z
Zefan Zhang
College of Computer Science and Technology, Key Laboratory of Symbolic Computation and Knowledge Engineering, Ministry of Education, Jilin University
K
Kehua Zhu
College of Software, Jilin University
S
Shijie Jiang
College of Computer Science and Technology, Key Laboratory of Symbolic Computation and Knowledge Engineering, Ministry of Education, Jilin University
H
Hongyuan Lu
Facemind Group
S
Shengkai Sun
School of Computer Science and Information Engineering, Hefei University of Technology
Tian Bai
Tian Bai
University of Electronic Science and Technology of China
Computer Science