TimeSearch: Hierarchical Video Search with Spotlight and Reflection for Human-like Long Video Understanding

📅 2025-04-02
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Large video-language models (LVLMs) suffer from visual hallucinations and inaccurate temporal reasoning when processing long videos due to excessive frame sampling. To address this, we propose a human-inspired hierarchical temporal search framework featuring two core mechanisms: (1) *Spotlight*, which performs spatiotemporal-enhanced frame representation (TAFR) and time-bound dynamic focusing to identify salient events; and (2) *Reflection*, an intrinsic temporal self-reflection and verification module within the LVLM that iteratively validates temporal coherence. Our method integrates autoregressive modeling, temporal attention guidance, and confidence-driven progressive event exploration. Experiments demonstrate significant improvements: +9.7 percentage points in accuracy on LVBench (41.8% → 51.5%) and +11.8% in mIoU for temporal localization on Charades-STA. These results validate that TAFR enables lightweight yet effective fine-grained temporal understanding in LVLMs.

Technology Category

Application Category

📝 Abstract
Large video-language models (LVLMs) have shown remarkable performance across various video-language tasks. However, they encounter significant challenges when processing long videos because of the large number of video frames involved. Downsampling long videos in either space or time can lead to visual hallucinations, making it difficult to accurately interpret long videos. Motivated by human hierarchical temporal search strategies, we propose extbf{TimeSearch}, a novel framework enabling LVLMs to understand long videos in a human-like manner. TimeSearch integrates two human-like primitives into a unified autoregressive LVLM: 1) extbf{Spotlight} efficiently identifies relevant temporal events through a Temporal-Augmented Frame Representation (TAFR), explicitly binding visual features with timestamps; 2) extbf{Reflection} evaluates the correctness of the identified events, leveraging the inherent temporal self-reflection capabilities of LVLMs. TimeSearch progressively explores key events and prioritizes temporal search based on reflection confidence. Extensive experiments on challenging long-video benchmarks confirm that TimeSearch substantially surpasses previous state-of-the-art, improving the accuracy from 41.8% to 51.5% on the LVBench. Additionally, experiments on temporal grounding demonstrate that appropriate TAFR is adequate to effectively stimulate the surprising temporal grounding ability of LVLMs in a simpler yet versatile manner, which improves mIoU on Charades-STA by 11.8%. The code will be released.
Problem

Research questions and friction points this paper is trying to address.

Enables LVLMs to understand long videos human-like manner
Reduces visual hallucinations in long video processing
Improves temporal grounding accuracy in video understanding
Innovation

Methods, ideas, or system contributions that make the work stand out.

Hierarchical temporal search for long videos
Temporal-Augmented Frame Representation (TAFR) binding
Autoregressive LVLM with spotlight and reflection
Junwen Pan
Junwen Pan
ByteDance
Deep LearningMachine LearningImage Segmentation
R
Rui Zhang
ByteDance
X
Xin Wan
ByteDance
Y
Yuan Zhang
ByteDance, School of Computer Science, Peking University
M
Ming Lu
School of Computer Science, Peking University
Q
Qi She
ByteDance