ProPy: Building Interactive Prompt Pyramids upon CLIP for Partially Relevant Video Retrieval

📅 2025-08-26

📈 Citations: 0

✨ Influential: 0

career value

196K/year

🤖 AI Summary

Partial Relevant Video Retrieval (PRVR) aims to precisely localize video segments that are semantically relevant to queries describing only a subset of the video content. However, existing approaches predominantly rely on single-modality features and fail to fully exploit the representational power of vision-language pre-trained models. This paper presents the first systematic adaptation of CLIP to PRVR, introducing two key innovations: (1) a Prompt Pyramid structure that constructs multi-granularity semantic prompts to explicitly model hierarchical event relationships, and (2) an Ancestor–Descendant Interaction mechanism that enables dynamic cross-segment semantic alignment. The method integrates CLIP-based prompt learning, multi-granularity event encoding, and hierarchical semantic interaction. It achieves state-of-the-art performance on three standard benchmarks, significantly outperforming prior methods. The source code is publicly available.

Technology Category

Application Category

📝 Abstract

Partially Relevant Video Retrieval (PRVR) is a practical yet challenging task that involves retrieving videos based on queries relevant to only specific segments. While existing works follow the paradigm of developing models to process unimodal features, powerful pretrained vision-language models like CLIP remain underexplored in this field. To bridge this gap, we propose ProPy, a model with systematic architectural adaption of CLIP specifically designed for PRVR. Drawing insights from the semantic relevance of multi-granularity events, ProPy introduces two key innovations: (1) A Prompt Pyramid structure that organizes event prompts to capture semantics at multiple granularity levels, and (2) An Ancestor-Descendant Interaction Mechanism built on the pyramid that enables dynamic semantic interaction among events. With these designs, ProPy achieves SOTA performance on three public datasets, outperforming previous models by significant margins. Code is available at https://github.com/BUAAPY/ProPy.

Problem

Research questions and friction points this paper is trying to address.

Retrieving videos with partially relevant query segments

Adapting CLIP for multi-granularity event semantics

Enabling dynamic semantic interaction among video events

Innovation

Methods, ideas, or system contributions that make the work stand out.

Prompt Pyramid structure for multi-granularity event semantics

Ancestor-Descendant Interaction Mechanism for dynamic semantic exchange

Systematic architectural adaptation of CLIP for video retrieval

🔎 Similar Papers

Chrono: A Simple Blueprint for Representing Time in MLLMs