MLLMs Know When Before Speaking: Revealing and Recovering Temporal Grounding via Attention Cues

📅 2026-05-20

📈 Citations: 0

✨ Influential: 0

career value

190K/year

🤖 AI Summary

Although multimodal large language models (MLLMs) can fluently describe video content, they often generate unreliable temporal intervals in temporal grounding tasks. This work reveals that MLLMs already perceive the correct temporal information during the prefilling phase through specific cross-modal attention heads—termed Temporal Grounding Heads (TG-Heads)—but suffer from attention drift during generation. To bridge this gap, the authors propose a training-free “read-and-regenerate” inference framework that leverages identified TG-Heads to extract key video segments, then applies video cropping and attention masking to refocus contextual representation and recover accurate temporal localization. This approach provides the first evidence of a temporal perception–generation discrepancy within MLLMs and achieves consistent performance gains across three benchmarks, with improvements up to +3.5 mIoU on MiMo-VL-7B, Qwen3-VL-8B, and TimeLens-8B.

📝 Abstract

Video temporal grounding (VTG), which localizes the start and end times of a queried event in an untrimmed video, is a key test of whether multimodal large language models (MLLMs) understand not only what happens but also when it happens. Although modern MLLMs describe video content fluently, their timestamp predictions remain unreliable, while existing remedies either require costly post-training on temporal annotations or rely on coarse training-free heuristics. In this work, we probe the cross-modal attention of MLLMs and uncover a perception-generation gap. Our key finding is that MLLMs often know the target interval during prefill, but lose this signal when generating the final answer. In the prefill stage, a sparse set of attention heads, which we call \emph{Temporal Grounding Heads} (TG-Heads), concentrates query-to-video attention on the ground-truth interval. During autoregressive decoding, however, the answer tokens shift attention away from this interval toward visually salient but query-irrelevant segments. This observation motivates an inference-time read-then-regenerate framework. We first convert TG-Head prefill attention into a debiased frame-level relevance signal and extract the high-attention interval it highlights. We then re-invoke the MLLM with visual context restricted to this interval, using video cropping or attention masking to suppress distractors. Without parameter updates and architectural changes, our framework consistently improves MiMo-VL-7B, Qwen3-VL-8B, and TimeLens-8B on three VTG benchmarks, with gains of up to +3.5 mIoU. The project website can be found at https://ddz16.github.io/mllmsknowwhen.github.io/.

Problem

Research questions and friction points this paper is trying to address.

Video Temporal Grounding

Multimodal Large Language Models

Temporal Localization

Timestamp Prediction

Cross-modal Attention

Innovation

Methods, ideas, or system contributions that make the work stand out.

Temporal Grounding Heads

video temporal grounding

cross-modal attention