SEASON: Mitigating Temporal Hallucination in Video Large Language Models via Self-Diagnostic Contrastive Decoding

📅 2025-12-04

📈 Citations: 0

✨ Influential: 0

career value

196K/year

🤖 AI Summary

Video large language models (VideoLLMs) frequently suffer from temporal hallucinations—generating temporally inconsistent or causally implausible descriptions—due to insufficient temporal awareness, leading to severe factual inaccuracies. To address this, we propose a training-free self-diagnostic contrastive decoding method, the first to explicitly target temporal hallucinations in VideoLLMs: during inference, it dynamically identifies the hallucination propensity of each output token and performs token-level correction using adaptively generated spatiotemporal negative samples. Crucially, our approach requires no fine-tuning and significantly improves both temporal and spatial fidelity of generated descriptions. Experiments demonstrate that it consistently outperforms existing training-free anti-hallucination methods across three dedicated hallucination evaluation benchmarks. Moreover, it achieves measurable gains on four general video understanding benchmarks, confirming its effectiveness, generalizability, and practical utility.

Technology Category

Application Category

📝 Abstract

Video Large Language Models (VideoLLMs) have shown remarkable progress in video understanding. However, these models still struggle to effectively perceive and exploit rich temporal information in videos when responding to user queries. Therefore, they often generate descriptions of events that are temporal inconsistent or causally implausible, causing severe hallucination issues. While most prior studies have focused on spatial hallucinations (e.g. object mismatches), temporal reasoning in video understanding remains relatively underexplored. To address this issue, we propose Self-Diagnostic Contrastive Decoding (SEASON), a training-free method that adaptively enhances temporal and spatial faithfulness for each output token. It achieves this by dynamically diagnosing each token's hallucination tendency and applying adaptive contrastive decoding against its corresponding temporal and spatial negatives. Extensive experiments demonstrate that SEASON outperforms all existing training-free hallucination mitigation approaches on three hallucination examination benchmarks, while further improves VideoLLMs across four general video understanding benchmarks. The code will be released upon acceptance.

Problem

Research questions and friction points this paper is trying to address.

Mitigating temporal hallucination in VideoLLMs

Enhancing temporal and spatial faithfulness in video descriptions

Addressing temporal inconsistency in video understanding models

Innovation

Methods, ideas, or system contributions that make the work stand out.

Self-diagnostic contrastive decoding for temporal hallucinations

Training-free method adaptively enhances temporal and spatial faithfulness

Dynamic token hallucination diagnosis with adaptive contrastive decoding

🔎 Similar Papers

EventHallusion: Diagnosing Event Hallucinations in Video LLMs