🤖 AI Summary
Long-video understanding faces a critical challenge in efficiently retrieving key frames—specifically, precisely localizing 1–5 query-relevant frames within tens of thousands of video frames. To formalize this, we introduce the “Long-Video Haystack” (LV-Haystack) problem paradigm. Method: We construct the first fine-grained temporal retrieval benchmark, LV-Haystack, and propose T*, a lightweight framework that pioneers the adaptation of image-level visual grounding capabilities to temporal search. T* introduces adaptive spatiotemporal scaling and sparse key-frame sampling to enable efficient, high-precision localization. Contribution/Results: On LV-Haystack, T* achieves significant gains in temporal F1, establishing new state-of-the-art performance. Under a strict 32-frame budget on LongVideoBench XL, T* boosts accuracy by 2.6% for GPT-4o and 5.9% for LLaVA-OneVision-72B. These results validate the effectiveness and generalizability of our cross-modal spatiotemporal grounding paradigm for long-video understanding.
📝 Abstract
Efficient understanding of long-form videos remains a significant challenge in computer vision. In this work, we revisit temporal search paradigms for long-form video understanding, studying a fundamental issue pertaining to all state-of-the-art (SOTA) long-context vision-language models (VLMs). In particular, our contributions are two-fold: First, we formulate temporal search as a Long Video Haystack problem, i.e., finding a minimal set of relevant frames (typically one to five) among tens of thousands of frames from real-world long videos given specific queries. To validate our formulation, we create LV-Haystack, the first benchmark containing 3,874 human-annotated instances with fine-grained evaluation metrics for assessing keyframe search quality and computational efficiency. Experimental results on LV-Haystack highlight a significant research gap in temporal search capabilities, with SOTA keyframe selection methods achieving only 2.1% temporal F1 score on the LVBench subset. Next, inspired by visual search in images, we re-think temporal searching and propose a lightweight keyframe searching framework, T*, which casts the expensive temporal search as a spatial search problem. T* leverages superior visual localization capabilities typically used in images and introduces an adaptive zooming-in mechanism that operates across both temporal and spatial dimensions. Our extensive experiments show that when integrated with existing methods, T* significantly improves SOTA long-form video understanding performance. Specifically, under an inference budget of 32 frames, T* improves GPT-4o's performance from 50.5% to 53.1% and LLaVA-OneVision-72B's performance from 56.5% to 62.4% on LongVideoBench XL subset. Our PyTorch code, benchmark dataset and models are included in the Supplementary material.