🤖 AI Summary
Large video-language models (LVLMs) suffer from visual hallucinations and inaccurate temporal reasoning when processing long videos due to excessive frame sampling. To address this, we propose a human-inspired hierarchical temporal search framework featuring two core mechanisms: (1) *Spotlight*, which performs spatiotemporal-enhanced frame representation (TAFR) and time-bound dynamic focusing to identify salient events; and (2) *Reflection*, an intrinsic temporal self-reflection and verification module within the LVLM that iteratively validates temporal coherence. Our method integrates autoregressive modeling, temporal attention guidance, and confidence-driven progressive event exploration. Experiments demonstrate significant improvements: +9.7 percentage points in accuracy on LVBench (41.8% → 51.5%) and +11.8% in mIoU for temporal localization on Charades-STA. These results validate that TAFR enables lightweight yet effective fine-grained temporal understanding in LVLMs.
📝 Abstract
Large video-language models (LVLMs) have shown remarkable performance across various video-language tasks. However, they encounter significant challenges when processing long videos because of the large number of video frames involved. Downsampling long videos in either space or time can lead to visual hallucinations, making it difficult to accurately interpret long videos. Motivated by human hierarchical temporal search strategies, we propose extbf{TimeSearch}, a novel framework enabling LVLMs to understand long videos in a human-like manner. TimeSearch integrates two human-like primitives into a unified autoregressive LVLM: 1) extbf{Spotlight} efficiently identifies relevant temporal events through a Temporal-Augmented Frame Representation (TAFR), explicitly binding visual features with timestamps; 2) extbf{Reflection} evaluates the correctness of the identified events, leveraging the inherent temporal self-reflection capabilities of LVLMs. TimeSearch progressively explores key events and prioritizes temporal search based on reflection confidence. Extensive experiments on challenging long-video benchmarks confirm that TimeSearch substantially surpasses previous state-of-the-art, improving the accuracy from 41.8% to 51.5% on the LVBench. Additionally, experiments on temporal grounding demonstrate that appropriate TAFR is adequate to effectively stimulate the surprising temporal grounding ability of LVLMs in a simpler yet versatile manner, which improves mIoU on Charades-STA by 11.8%. The code will be released.