SEASON: Mitigating Temporal Hallucination in Video Large Language Models via Self-Diagnostic Contrastive Decoding

📅 2025-12-04
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Video large language models (VideoLLMs) frequently suffer from temporal hallucinations—generating temporally inconsistent or causally implausible descriptions—due to insufficient temporal awareness, leading to severe factual inaccuracies. To address this, we propose a training-free self-diagnostic contrastive decoding method, the first to explicitly target temporal hallucinations in VideoLLMs: during inference, it dynamically identifies the hallucination propensity of each output token and performs token-level correction using adaptively generated spatiotemporal negative samples. Crucially, our approach requires no fine-tuning and significantly improves both temporal and spatial fidelity of generated descriptions. Experiments demonstrate that it consistently outperforms existing training-free anti-hallucination methods across three dedicated hallucination evaluation benchmarks. Moreover, it achieves measurable gains on four general video understanding benchmarks, confirming its effectiveness, generalizability, and practical utility.

Technology Category

Application Category

📝 Abstract
Video Large Language Models (VideoLLMs) have shown remarkable progress in video understanding. However, these models still struggle to effectively perceive and exploit rich temporal information in videos when responding to user queries. Therefore, they often generate descriptions of events that are temporal inconsistent or causally implausible, causing severe hallucination issues. While most prior studies have focused on spatial hallucinations (e.g. object mismatches), temporal reasoning in video understanding remains relatively underexplored. To address this issue, we propose Self-Diagnostic Contrastive Decoding (SEASON), a training-free method that adaptively enhances temporal and spatial faithfulness for each output token. It achieves this by dynamically diagnosing each token's hallucination tendency and applying adaptive contrastive decoding against its corresponding temporal and spatial negatives. Extensive experiments demonstrate that SEASON outperforms all existing training-free hallucination mitigation approaches on three hallucination examination benchmarks, while further improves VideoLLMs across four general video understanding benchmarks. The code will be released upon acceptance.
Problem

Research questions and friction points this paper is trying to address.

Mitigating temporal hallucination in VideoLLMs
Enhancing temporal and spatial faithfulness in video descriptions
Addressing temporal inconsistency in video understanding models
Innovation

Methods, ideas, or system contributions that make the work stand out.

Self-diagnostic contrastive decoding for temporal hallucinations
Training-free method adaptively enhances temporal and spatial faithfulness
Dynamic token hallucination diagnosis with adaptive contrastive decoding
🔎 Similar Papers
No similar papers found.
C
Chang-Hsun Wu
Graduate Institute of Communication Engineering, National Taiwan University
Kai-Po Chang
Kai-Po Chang
National Taiwan University
vision-language learning
Y
Yu-Yang Sheng
Graduate Institute of Communication Engineering, National Taiwan University
H
Hung-Kai Chung
Graduate Institute of Communication Engineering, National Taiwan University
K
Kuei-Chun Wang
Graduate Institute of Communication Engineering, National Taiwan University
Yu-Chiang Frank Wang
Yu-Chiang Frank Wang
National Taiwan University & NVIDIA
Computer VisionDeep LearningMachine LearningArtificial Intelligence