Uneven Event Modeling for Partially Relevant Video Retrieval

📅 2025-06-01
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address ambiguous event boundaries and frame-level misalignment in Partially Relevant Video Retrieval (PRVR), this paper proposes the first unequal-length event modeling framework. Methodologically: (1) a Progressive Grouped Video Segmentation (PGVS) module is designed to enable text-guided, dynamic boundary partitioning; (2) Cross-modal Attention-guided Event Refinement (CAER) is introduced to jointly incorporate temporal dependencies and inter-frame semantic similarity, thereby enhancing fine-grained text-video alignment. Evaluated on two standard PRVR benchmarks, our approach achieves state-of-the-art performance, significantly improving localization accuracy and retrieval recall for partially relevant segments. Experimental results validate the effectiveness of modeling events as semantically coherent, boundary-precise, and variable-length units—marking a departure from conventional fixed-length or rigidly segmented paradigms.

Technology Category

Application Category

📝 Abstract
Given a text query, partially relevant video retrieval (PRVR) aims to retrieve untrimmed videos containing relevant moments, wherein event modeling is crucial for partitioning the video into smaller temporal events that partially correspond to the text. Previous methods typically segment videos into a fixed number of equal-length clips, resulting in ambiguous event boundaries. Additionally, they rely on mean pooling to compute event representations, inevitably introducing undesired misalignment. To address these, we propose an Uneven Event Modeling (UEM) framework for PRVR. We first introduce the Progressive-Grouped Video Segmentation (PGVS) module, to iteratively formulate events in light of both temporal dependencies and semantic similarity between consecutive frames, enabling clear event boundaries. Furthermore, we also propose the Context-Aware Event Refinement (CAER) module to refine the event representation conditioned the text's cross-attention. This enables event representations to focus on the most relevant frames for a given text, facilitating more precise text-video alignment. Extensive experiments demonstrate that our method achieves state-of-the-art performance on two PRVR benchmarks.
Problem

Research questions and friction points this paper is trying to address.

Modeling uneven events in partially relevant video retrieval
Improving event boundaries via progressive grouped segmentation
Refining event representations with text-conditioned cross-attention
Innovation

Methods, ideas, or system contributions that make the work stand out.

Progressive-Grouped Video Segmentation for clear boundaries
Context-Aware Event Refinement via cross-attention
Uneven Event Modeling for precise text-video alignment
🔎 Similar Papers
No similar papers found.
S
Sa Zhu
Institute of Information Engineering, Chinese Academy of Sciences
Huashan Chen
Huashan Chen
Institute of Information Engineering, Chinese Academy of Sciences
Cybersecurity MetricsBiometric AuthenticationVR/AR Security & Privacy
W
Wanqian Zhang
Institute of Information Engineering, Chinese Academy of Sciences
Jinchao Zhang
Jinchao Zhang
WeChat AI - Pattern Recognition Center
Deep LearningNatural Language ProcessingMachine TranslationDialogue System
Z
Zexian Yang
Institute of Information Engineering, Chinese Academy of Sciences
Xiaoshuai Hao
Xiaoshuai Hao
Beijing Academy of Artificial Intelligence,BAAI
vision and language
B
Bo Li
Institute of Information Engineering, Chinese Academy of Sciences