STEAR: Layer-Aware Spatiotemporal Evidence Intervention for Hallucination Mitigation in Video Large Language Models

๐Ÿ“… 2026-04-03
๐Ÿ“ˆ Citations: 0
โœจ Influential: 0
๐Ÿ“„ PDF
๐Ÿค– AI Summary
Video large language models often suffer from spatiotemporal hallucinations due to insufficient visual grounding or erroneous temporal relationships. To address this, this work proposes the STEAR framework, which introduces a layer-aware spatiotemporal evidence intervention mechanismโ€”the first of its kind. Specifically, STEAR restores local visual grounding during mid-layer decoding and suppresses inconsistent reasoning in later stages through patch-level temporal counterfactual perturbations. Built upon a single-pass encoding architecture, the method integrates token-conditioned visual evidence selection with a hierarchical intervention strategy to precisely delineate the distinct roles of different decoder layers in visual grounding versus language generation. Experiments demonstrate that STEAR significantly reduces hallucinations across multiple video large language models and challenging benchmarks, effectively enhancing the factual accuracy, temporal coherence, and robustness of generated content.
๐Ÿ“ Abstract
Video Large Language Models (Video-LLMs) remain prone to spatiotemporal hallucinations, often generating visually unsupported details or incorrect temporal relations. Existing mitigation methods typically treat hallucination as a uniform decoding failure, applying globally shared correction rules. We instead observe that decoder layers contribute differently to visual grounding and later linguistic composition, indicating that intervention must be layer-aware. Based on this insight, we propose STEAR, a layer-aware spatiotemporal evidence intervention framework. STEAR identifies high-risk decoding steps and selects token-conditioned visual evidence from grounding-sensitive middle layers. It uses this shared evidence for two coupled purposes: restoring missing local grounding in middle layers, and constructing temporally perturbed patch-level counterfactuals to falsify inconsistent reasoning during late-layer decoding. Consequently, STEAR mitigates both spatial and temporal hallucinations within an efficient single-encode inference framework. Experiments across representative Video-LLM backbones and challenging benchmarks demonstrate that STEAR consistently reduces hallucinations while improving faithfulness, temporal consistency, and robustness. Our results confirm that reliable video decoding relies on intervening on precise evidence at the right layer, rather than enforcing a global penalty. The code is provided in the Supplementary Material.
Problem

Research questions and friction points this paper is trying to address.

spatiotemporal hallucinations
Video Large Language Models
visual grounding
temporal consistency
decoder layers
Innovation

Methods, ideas, or system contributions that make the work stand out.

layer-aware intervention
spatiotemporal hallucination
visual grounding
counterfactual reasoning
video large language models
๐Ÿ”Ž Similar Papers
No similar papers found.