Re-thinking Temporal Search for Long-Form Video Understanding

📅 2025-04-03
📈 Citations: 3
Influential: 0
📄 PDF
🤖 AI Summary
Long-video understanding faces a critical challenge in efficiently retrieving key frames—specifically, precisely localizing 1–5 query-relevant frames within tens of thousands of video frames. To formalize this, we introduce the “Long-Video Haystack” (LV-Haystack) problem paradigm. Method: We construct the first fine-grained temporal retrieval benchmark, LV-Haystack, and propose T*, a lightweight framework that pioneers the adaptation of image-level visual grounding capabilities to temporal search. T* introduces adaptive spatiotemporal scaling and sparse key-frame sampling to enable efficient, high-precision localization. Contribution/Results: On LV-Haystack, T* achieves significant gains in temporal F1, establishing new state-of-the-art performance. Under a strict 32-frame budget on LongVideoBench XL, T* boosts accuracy by 2.6% for GPT-4o and 5.9% for LLaVA-OneVision-72B. These results validate the effectiveness and generalizability of our cross-modal spatiotemporal grounding paradigm for long-video understanding.

Technology Category

Application Category

📝 Abstract
Efficient understanding of long-form videos remains a significant challenge in computer vision. In this work, we revisit temporal search paradigms for long-form video understanding, studying a fundamental issue pertaining to all state-of-the-art (SOTA) long-context vision-language models (VLMs). In particular, our contributions are two-fold: First, we formulate temporal search as a Long Video Haystack problem, i.e., finding a minimal set of relevant frames (typically one to five) among tens of thousands of frames from real-world long videos given specific queries. To validate our formulation, we create LV-Haystack, the first benchmark containing 3,874 human-annotated instances with fine-grained evaluation metrics for assessing keyframe search quality and computational efficiency. Experimental results on LV-Haystack highlight a significant research gap in temporal search capabilities, with SOTA keyframe selection methods achieving only 2.1% temporal F1 score on the LVBench subset. Next, inspired by visual search in images, we re-think temporal searching and propose a lightweight keyframe searching framework, T*, which casts the expensive temporal search as a spatial search problem. T* leverages superior visual localization capabilities typically used in images and introduces an adaptive zooming-in mechanism that operates across both temporal and spatial dimensions. Our extensive experiments show that when integrated with existing methods, T* significantly improves SOTA long-form video understanding performance. Specifically, under an inference budget of 32 frames, T* improves GPT-4o's performance from 50.5% to 53.1% and LLaVA-OneVision-72B's performance from 56.5% to 62.4% on LongVideoBench XL subset. Our PyTorch code, benchmark dataset and models are included in the Supplementary material.
Problem

Research questions and friction points this paper is trying to address.

Efficient understanding of long-form videos remains challenging.
Temporal search in long videos needs better methods.
Current keyframe selection methods perform poorly (2.1% F1).
Innovation

Methods, ideas, or system contributions that make the work stand out.

Formulates temporal search as Long Video Haystack problem
Proposes lightweight keyframe searching framework T*
Adaptive zooming-in mechanism across temporal and spatial dimensions