ClipTBP: Clip-Pair based Temporal Boundary Prediction with Boundary-Aware Learning for Moment Retrieval

📅 2026-04-30

📈 Citations: 0

✨ Influential: 0

career value

182K/year

🤖 AI Summary

Existing video moment retrieval methods often overlook the semantic relationships among multiple candidate moments when computing vision-language similarity, rendering them susceptible to visually similar but semantically irrelevant distractors and leading to inaccurate temporal boundary prediction. To address this, this work proposes ClipTBP, a framework that explicitly models the dependency structure among matching moments through clip-pair semantic relation modeling and boundary-aware learning. ClipTBP introduces a primary-auxiliary boundary joint optimization mechanism, integrating moment-level alignment loss with a Transformer-based temporal regression module. The approach consistently improves performance across multiple baseline models and demonstrates notably enhanced robustness and generalization in boundary prediction, particularly under ambiguous query scenarios.

📝 Abstract

Video moment retrieval is the task of retrieving specific segments of a video corresponding to a given text query. Recent studies have been conducted to improve multimodal alignment performance through visual-linguistic similarity learning at the snippet-level and transformer-based temporal boundary regression. However, existing models do not calculate similarity by considering the relationships between multiple answer segments that match the query. Therefore, existing models are easily influenced by visually similar segments in the surrounding context. Existing models calculate similarity at the snippet-level and ignore the relationships between multiple answer segments corresponding to a single query. Therefore, they struggle to exclude segments irrelevant to the query. To address this issues, we propose ClipTBP, a clip-pair temporal boundary prediction framework based on boundary-aware learning. ClipTBP introduces a clip-level alignment loss for explicitly learning the semantic relationship between answer segments. ClipTBP also predicts accurate temporal boundaries by applying both main boundary loss and auxiliary boundary loss. ClipTBP consistently improves performance when applied to various existing models and demonstrates more robust boundary prediction performance even in ambiguous query scenarios.

Problem

Research questions and friction points this paper is trying to address.

moment retrieval

temporal boundary prediction

multimodal alignment

visual-linguistic similarity

boundary-aware learning

Innovation

Methods, ideas, or system contributions that make the work stand out.

clip-pair

temporal boundary prediction

boundary-aware learning