Grounding-MD: Grounded Video-language Pre-training for Open-World Moment Detection

📅 2025-04-20

📈 Citations: 0

✨ Influential: 0

career value

207K/year

🤖 AI Summary

This work addresses open-world temporal grounding—precisely localizing action/event occurrences in videos given arbitrary natural language queries. We propose the first unified framework supporting both zero-shot and supervised temporal localization. Methodologically, we introduce a structured-prompt-based video–language joint pretraining paradigm that overcomes closed-vocabulary constraints; our architecture comprises a cross-modal fusion encoder and a text-guided decoder, trained end-to-end on large-scale temporally annotated action data with natural language descriptions. This is the first approach to jointly model temporal action detection and moment retrieval, thereby enhancing semantic generalization and cross-task synergy. Evaluated on four major benchmarks—ActivityNet, THUMOS14, TACoS, and Charades-STA—our framework establishes new state-of-the-art results under both zero-shot and supervised settings.

Technology Category

Application Category

📝 Abstract

Temporal Action Detection and Moment Retrieval constitute two pivotal tasks in video understanding, focusing on precisely localizing temporal segments corresponding to specific actions or events. Recent advancements introduced Moment Detection to unify these two tasks, yet existing approaches remain confined to closed-set scenarios, limiting their applicability in open-world contexts. To bridge this gap, we present Grounding-MD, an innovative, grounded video-language pre-training framework tailored for open-world moment detection. Our framework incorporates an arbitrary number of open-ended natural language queries through a structured prompt mechanism, enabling flexible and scalable moment detection. Grounding-MD leverages a Cross-Modality Fusion Encoder and a Text-Guided Fusion Decoder to facilitate comprehensive video-text alignment and enable effective cross-task collaboration. Through large-scale pre-training on temporal action detection and moment retrieval datasets, Grounding-MD demonstrates exceptional semantic representation learning capabilities, effectively handling diverse and complex query conditions. Comprehensive evaluations across four benchmark datasets including ActivityNet, THUMOS14, ActivityNet-Captions, and Charades-STA demonstrate that Grounding-MD establishes new state-of-the-art performance in zero-shot and supervised settings in open-world moment detection scenarios. All source code and trained models will be released.

Problem

Research questions and friction points this paper is trying to address.

Unifies temporal action detection and moment retrieval tasks

Addresses limitations in open-world moment detection scenarios

Enables flexible detection with open-ended natural language queries

Innovation

Methods, ideas, or system contributions that make the work stand out.

Grounded video-language pre-training for open-world

Cross-Modality Fusion Encoder and Text-Guided Decoder

Structured prompt mechanism for flexible moment detection

🔎 Similar Papers

Chrono: A Simple Blueprint for Representing Time in MLLMs