Mitigating Semantic Collapse in Partially Relevant Video Retrieval

📅 2025-10-31
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This paper addresses semantic collapse in partial-relevance video retrieval (PRVR)—a phenomenon where embeddings of relevant segments from multi-event videos become overly clustered, while semantically similar segments across videos are erroneously separated. To tackle this, we propose a hierarchical semantic disentanglement framework. Our method comprises three key components: (1) text-association preservation learning to enforce consistent text-video alignment; (2) cross-branch video alignment that enables fine-grained segment disentanglement under multi-scale temporal representations; and (3) an integrated optimization strategy combining contrastive learning, order-aware token merging, adaptive cross-video alignment, and foundation-model-derived semantic priors. Evaluated on multiple PRVR benchmarks, our approach significantly improves retrieval accuracy. It is the first to systematically mitigate semantic collapse within multi-scale temporal modeling, thereby enhancing fine-grained matching capability in complex, real-world scenarios.

Technology Category

Application Category

📝 Abstract
Partially Relevant Video Retrieval (PRVR) seeks videos where only part of the content matches a text query. Existing methods treat every annotated text-video pair as a positive and all others as negatives, ignoring the rich semantic variation both within a single video and across different videos. Consequently, embeddings of both queries and their corresponding video-clip segments for distinct events within the same video collapse together, while embeddings of semantically similar queries and segments from different videos are driven apart. This limits retrieval performance when videos contain multiple, diverse events. This paper addresses the aforementioned problems, termed as semantic collapse, in both the text and video embedding spaces. We first introduce Text Correlation Preservation Learning, which preserves the semantic relationships encoded by the foundation model across text queries. To address collapse in video embeddings, we propose Cross-Branch Video Alignment (CBVA), a contrastive alignment method that disentangles hierarchical video representations across temporal scales. Subsequently, we introduce order-preserving token merging and adaptive CBVA to enhance alignment by producing video segments that are internally coherent yet mutually distinctive. Extensive experiments on PRVR benchmarks demonstrate that our framework effectively prevents semantic collapse and substantially improves retrieval accuracy.
Problem

Research questions and friction points this paper is trying to address.

Preventing semantic collapse in text-video embeddings
Disentangling hierarchical video representations across scales
Enhancing retrieval accuracy for partially relevant videos
Innovation

Methods, ideas, or system contributions that make the work stand out.

Text Correlation Preservation Learning maintains query relationships
Cross-Branch Video Alignment disentangles hierarchical video representations
Order-preserving token merging creates coherent yet distinctive segments
🔎 Similar Papers
No similar papers found.