Relaxing Anchor-Frame Dominance for Mitigating Hallucinations in Video Large Language Models

📅 2026-04-14

📈 Citations: 0

✨ Influential: 0

career value

181K/year

🤖 AI Summary

This work addresses the tendency of video large language models to over-rely on a few anchor frames during generation, which leads to imbalanced temporal evidence aggregation and induces hallucinations. The study is the first to uncover the link between anchor-frame dominance and temporal hallucination, proposing a decoding-side temporal rebalancing mechanism that requires no additional training or auxiliary models. By applying layer-selective attention calibration, the method dynamically adjusts frame-level attention distributions in mid-to-late decoding layers to enhance utilization of previously overlooked frames. Evaluated across multiple video understanding and hallucination benchmarks, the approach significantly improves robustness while preserving efficient inference and original comprehension performance.

Technology Category

Application Category

📝 Abstract

Recent Video Large Language Models (Video-LLMs) have demonstrated strong capability in video understanding, yet they still suffer from hallucinations. Existing mitigation methods typically rely on training, input modification, auxiliary guidance, or additional decoding procedures, while largely overlooking a more fundamental challenge. During generation, Video-LLMs tend to over-rely on a limited portion of temporal evidence, leading to temporally imbalanced evidence aggregation across the video. To address this issue, we investigate a decoder-side phenomenon in which the model exhibits a temporally imbalanced concentration pattern. We term the frame with the highest aggregated frame-level attention mass the anchor frame. We find that this bias is largely independent of the input video and instead appears to reflect a persistent, model-specific structural or positional bias, whose over-dominance is closely associated with hallucination-prone generation. Motivated by this insight, we propose Decoder-side Temporal Rebalancing (DTR), a training-free, layer-selective inference method that rebalances temporal evidence allocation in middle-to-late decoder layers without altering visual encoding or requiring auxiliary models. DTR adaptively calibrates decoder-side visual attention to alleviate temporally imbalanced concentration and encourage under-attended frames to contribute more effectively to response generation. In this way, DTR guides the decoder to ground its outputs in temporally broader and more balanced video evidence. Extensive experiments on hallucination and video understanding benchmarks show that DTR consistently improves hallucination robustness across diverse Video-LLM families, while preserving competitive video understanding performance and high inference efficiency.

Problem

Research questions and friction points this paper is trying to address.

video hallucination

temporal imbalance

anchor frame

evidence aggregation

video understanding

Innovation

Methods, ideas, or system contributions that make the work stand out.

Decoder-side Temporal Rebalancing

anchor frame

temporal attention imbalance