Prototypes are Balanced Units for Efficient and Effective Partially Relevant Video Retrieval

📅 2025-04-17

📈 Citations: 0

✨ Influential: 0

career value

223K/year

🤖 AI Summary

To address the accuracy-efficiency trade-off induced by multi-scale temporal modeling in partial relevance video retrieval (PRVR), this paper proposes a prototypical video representation framework. Our core innovation is a **balanced prototype mechanism**, which jointly performs prototype-decoupled encoding and multi-granularity contextual aggregation to compress cross-temporal semantic information into a fixed number of semantically complementary, text-retrievable prototypes. We further introduce joint cross-modal and unimodal reconstruction objectives, coupled with video-mixing weak supervision, to collaboratively ensure prototype diversity, cross-modal alignment, and fidelity to original video content. Extensive experiments on TVR, ActivityNet-Captions, and QVHighlights demonstrate substantial improvements in retrieval accuracy while maintaining low computational and memory overhead—achieving a genuine win-win between precision and efficiency.

Technology Category

Application Category

📝 Abstract

In a retrieval system, simultaneously achieving search accuracy and efficiency is inherently challenging. This challenge is particularly pronounced in partially relevant video retrieval (PRVR), where incorporating more diverse context representations at varying temporal scales for each video enhances accuracy but increases computational and memory costs. To address this dichotomy, we propose a prototypical PRVR framework that encodes diverse contexts within a video into a fixed number of prototypes. We then introduce several strategies to enhance text association and video understanding within the prototypes, along with an orthogonal objective to ensure that the prototypes capture a diverse range of content. To keep the prototypes searchable via text queries while accurately encoding video contexts, we implement cross- and uni-modal reconstruction tasks. The cross-modal reconstruction task aligns the prototypes with textual features within a shared space, while the uni-modal reconstruction task preserves all video contexts during encoding. Additionally, we employ a video mixing technique to provide weak guidance to further align prototypes and associated textual representations. Extensive evaluations on TVR, ActivityNet-Captions, and QVHighlights validate the effectiveness of our approach without sacrificing efficiency.

Problem

Research questions and friction points this paper is trying to address.

Balancing accuracy and efficiency in video retrieval

Handling diverse temporal contexts without high costs

Aligning video prototypes with text queries effectively

Innovation

Methods, ideas, or system contributions that make the work stand out.

Encodes diverse video contexts into fixed prototypes

Uses cross-modal reconstruction for text-video alignment

Employs video mixing for prototype-text alignment

🔎 Similar Papers

Chrono: A Simple Blueprint for Representing Time in MLLMs