🤖 AI Summary
This work addresses zero-shot moment retrieval in hour-long videos—localizing target temporal segments in unseen videos using only natural language queries, without task-specific training. Method: We propose P2S, the first fully training-free framework, which employs an adaptive temporal segment generator to suppress combinatorial explosion of candidate proposals during search and replaces costly vision-language model (VLM)-based refinement with a lightweight query decomposition strategy. Contribution/Results: Our approach unifies zero-shot learning, multi-granularity semantic alignment, and temporal structure modeling to enable end-to-end, parameter-free inference. On the MAD benchmark, P2S achieves a new state-of-the-art R5@0.1 score, outperforming supervised methods by 3.7%. To our knowledge, this is the first work to empirically validate the feasibility and effectiveness of zero-shot temporal localization on hour-scale videos.
📝 Abstract
Zero-shot Long Video Moment Retrieval (ZLVMR) is the task of identifying temporal segments in hour-long videos using a natural language query without task-specific training. The core technical challenge of LVMR stems from the computational infeasibility of processing entire lengthy videos in a single pass. This limitation has established a 'Search-then-Refine' approach, where candidates are rapidly narrowed down, and only those portions are analyzed, as the dominant paradigm for LVMR. However, existing approaches to this paradigm face severe limitations. Conventional supervised learning suffers from limited scalability and poor generalization, despite substantial resource consumption. Yet, existing zero-shot methods also fail, facing a dual challenge: (1) their heuristic strategies cause a 'search' phase candidate explosion, and (2) the 'refine' phase, which is vulnerable to semantic discrepancy, requires high-cost VLMs for verification, incurring significant computational overhead. We propose extbf{P}oint- extbf{to}- extbf{S}pan (P2S), a novel training-free framework to overcome this challenge of inefficient 'search' and costly 'refine' phases. P2S overcomes these challenges with two key innovations: an 'Adaptive Span Generator' to prevent the search phase candidate explosion, and 'Query Decomposition' to refine candidates without relying on high-cost VLM verification. To our knowledge, P2S is the first zero-shot framework capable of temporal grounding in hour-long videos, outperforming supervised state-of-the-art methods by a significant margin (e.g., +3.7% on R5@0.1 on MAD).