Mitigating Hallucinations in Video Large Language Models via Spatiotemporal-Semantic Contrastive Decoding

📅 2026-01-30

📈 Citations: 0

✨ Influential: 0

career value

177K/year

🤖 AI Summary

This work addresses the challenge of hallucinations in video large language models (VLLMs), which often generate responses inconsistent with the visual content or factual reality. Existing decoding strategies struggle to precisely pinpoint the spatiotemporal and semantic origins of such hallucinations. To mitigate this, the authors propose a novel contrastive decoding framework that, during inference, constructs negative features by deliberately disrupting spatiotemporal coherence and semantic alignment, then contrasts them against the original video features to guide generation and suppress hallucinatory outputs. This approach introduces, for the first time, a spatiotemporal-semantic contrastive mechanism directly into the decoding process, effectively identifying and curbing fine-grained hallucinations without requiring any architectural modifications to the underlying model. Experimental results demonstrate a significant reduction in hallucination rates while preserving the model’s original video understanding and reasoning capabilities.

Technology Category

Application Category

📝 Abstract

Although Video Large Language Models perform remarkably well across tasks such as video understanding, question answering, and reasoning, they still suffer from the problem of hallucination, which refers to generating outputs that are inconsistent with explicit video content or factual evidence. However, existing decoding methods for mitigating video hallucinations, while considering the spatiotemporal characteristics of videos, mostly rely on heuristic designs. As a result, they fail to precisely capture the root causes of hallucinations and their fine-grained temporal and semantic correlations, leading to limited robustness and generalization in complex scenarios. To more effectively mitigate video hallucinations, we propose a novel decoding strategy termed Spatiotemporal-Semantic Contrastive Decoding. This strategy constructs negative features by deliberately disrupting the spatiotemporal consistency and semantic associations of video features, and suppresses video hallucinations through contrastive decoding against the original video features during inference. Extensive experiments demonstrate that our method not only effectively mitigates the occurrence of hallucinations, but also preserves the general video understanding and reasoning capabilities of the model.

Problem

Research questions and friction points this paper is trying to address.

hallucination

video large language models

spatiotemporal consistency

semantic correlation

decoding

Innovation

Methods, ideas, or system contributions that make the work stand out.

Spatiotemporal-Semantic Contrastive Decoding

Video Hallucination Mitigation

Contrastive Decoding