ProPy: Building Interactive Prompt Pyramids upon CLIP for Partially Relevant Video Retrieval

📅 2025-08-26
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Partial Relevant Video Retrieval (PRVR) aims to precisely localize video segments that are semantically relevant to queries describing only a subset of the video content. However, existing approaches predominantly rely on single-modality features and fail to fully exploit the representational power of vision-language pre-trained models. This paper presents the first systematic adaptation of CLIP to PRVR, introducing two key innovations: (1) a Prompt Pyramid structure that constructs multi-granularity semantic prompts to explicitly model hierarchical event relationships, and (2) an Ancestor–Descendant Interaction mechanism that enables dynamic cross-segment semantic alignment. The method integrates CLIP-based prompt learning, multi-granularity event encoding, and hierarchical semantic interaction. It achieves state-of-the-art performance on three standard benchmarks, significantly outperforming prior methods. The source code is publicly available.

Technology Category

Application Category

📝 Abstract
Partially Relevant Video Retrieval (PRVR) is a practical yet challenging task that involves retrieving videos based on queries relevant to only specific segments. While existing works follow the paradigm of developing models to process unimodal features, powerful pretrained vision-language models like CLIP remain underexplored in this field. To bridge this gap, we propose ProPy, a model with systematic architectural adaption of CLIP specifically designed for PRVR. Drawing insights from the semantic relevance of multi-granularity events, ProPy introduces two key innovations: (1) A Prompt Pyramid structure that organizes event prompts to capture semantics at multiple granularity levels, and (2) An Ancestor-Descendant Interaction Mechanism built on the pyramid that enables dynamic semantic interaction among events. With these designs, ProPy achieves SOTA performance on three public datasets, outperforming previous models by significant margins. Code is available at https://github.com/BUAAPY/ProPy.
Problem

Research questions and friction points this paper is trying to address.

Retrieving videos with partially relevant query segments
Adapting CLIP for multi-granularity event semantics
Enabling dynamic semantic interaction among video events
Innovation

Methods, ideas, or system contributions that make the work stand out.

Prompt Pyramid structure for multi-granularity event semantics
Ancestor-Descendant Interaction Mechanism for dynamic semantic exchange
Systematic architectural adaptation of CLIP for video retrieval
🔎 Similar Papers
Y
Yi Pan
State Key Laboratory of Multimodal Artificial Intelligence Systems, Institute of Automation, Chinese Academy of Sciences; School of Artificial Intelligence, University of Chinese Academy of Sciences
Y
Yujia Zhang
State Key Laboratory of Multimodal Artificial Intelligence Systems, Institute of Automation, Chinese Academy of Sciences
M
Michael Kampffmeyer
Department of Physics and Technology, UiT The Arctic University of Norway
Xiaoguang Zhao
Xiaoguang Zhao
Tsinghua University
MEMSMicrosystemsTHzMetamaterialWireless communication