ClipTBP: Clip-Pair based Temporal Boundary Prediction with Boundary-Aware Learning for Moment Retrieval

📅 2026-04-30
📈 Citations: 0
Influential: 0
📄 PDF

career value

173K/year
🤖 AI Summary
Existing video moment retrieval methods often overlook the semantic relationships among multiple candidate moments when computing vision-language similarity, rendering them susceptible to visually similar but semantically irrelevant distractors and leading to inaccurate temporal boundary prediction. To address this, this work proposes ClipTBP, a framework that explicitly models the dependency structure among matching moments through clip-pair semantic relation modeling and boundary-aware learning. ClipTBP introduces a primary-auxiliary boundary joint optimization mechanism, integrating moment-level alignment loss with a Transformer-based temporal regression module. The approach consistently improves performance across multiple baseline models and demonstrates notably enhanced robustness and generalization in boundary prediction, particularly under ambiguous query scenarios.
📝 Abstract
Video moment retrieval is the task of retrieving specific segments of a video corresponding to a given text query. Recent studies have been conducted to improve multimodal alignment performance through visual-linguistic similarity learning at the snippet-level and transformer-based temporal boundary regression. However, existing models do not calculate similarity by considering the relationships between multiple answer segments that match the query. Therefore, existing models are easily influenced by visually similar segments in the surrounding context. Existing models calculate similarity at the snippet-level and ignore the relationships between multiple answer segments corresponding to a single query. Therefore, they struggle to exclude segments irrelevant to the query. To address this issues, we propose ClipTBP, a clip-pair temporal boundary prediction framework based on boundary-aware learning. ClipTBP introduces a clip-level alignment loss for explicitly learning the semantic relationship between answer segments. ClipTBP also predicts accurate temporal boundaries by applying both main boundary loss and auxiliary boundary loss. ClipTBP consistently improves performance when applied to various existing models and demonstrates more robust boundary prediction performance even in ambiguous query scenarios.
Problem

Research questions and friction points this paper is trying to address.

moment retrieval
temporal boundary prediction
multimodal alignment
visual-linguistic similarity
boundary-aware learning
Innovation

Methods, ideas, or system contributions that make the work stand out.

clip-pair
temporal boundary prediction
boundary-aware learning
moment retrieval
multimodal alignment
🔎 Similar Papers