🤖 AI Summary
Partial Relevant Video Retrieval (PRVR) aims to precisely localize video segments that are semantically relevant to queries describing only a subset of the video content. However, existing approaches predominantly rely on single-modality features and fail to fully exploit the representational power of vision-language pre-trained models. This paper presents the first systematic adaptation of CLIP to PRVR, introducing two key innovations: (1) a Prompt Pyramid structure that constructs multi-granularity semantic prompts to explicitly model hierarchical event relationships, and (2) an Ancestor–Descendant Interaction mechanism that enables dynamic cross-segment semantic alignment. The method integrates CLIP-based prompt learning, multi-granularity event encoding, and hierarchical semantic interaction. It achieves state-of-the-art performance on three standard benchmarks, significantly outperforming prior methods. The source code is publicly available.
📝 Abstract
Partially Relevant Video Retrieval (PRVR) is a practical yet challenging task that involves retrieving videos based on queries relevant to only specific segments. While existing works follow the paradigm of developing models to process unimodal features, powerful pretrained vision-language models like CLIP remain underexplored in this field. To bridge this gap, we propose ProPy, a model with systematic architectural adaption of CLIP specifically designed for PRVR. Drawing insights from the semantic relevance of multi-granularity events, ProPy introduces two key innovations: (1) A Prompt Pyramid structure that organizes event prompts to capture semantics at multiple granularity levels, and (2) An Ancestor-Descendant Interaction Mechanism built on the pyramid that enables dynamic semantic interaction among events. With these designs, ProPy achieves SOTA performance on three public datasets, outperforming previous models by significant margins. Code is available at https://github.com/BUAAPY/ProPy.