Datasets and Recipes for Video Temporal Grounding via Reinforcement Learning

📅 2025-07-24

📈 Citations: 0

✨ Influential: 0

career value

210K/year

🤖 AI Summary

Video Temporal Grounding (VTG) suffers from weak temporal awareness and poor cross-domain generalization. To address these challenges, we propose a two-stage training framework: (1) supervised fine-tuning (SFT) on high-quality cold-start data to strengthen temporal boundary modeling; and (2) difficulty-adaptive reinforcement learning (RL), employing progressive sample selection to enhance robustness in temporal localization. Our method integrates Large Vision-Language Models (LVLMs) with instruction tuning, significantly improving fine-grained temporal semantic understanding. Evaluated on multiple VTG benchmarks, our approach achieves state-of-the-art (SOTA) performance—particularly excelling on complex queries and open-domain scenarios. To foster community advancement, we publicly release all datasets, models, and source code.

Technology Category

Application Category

📝 Abstract

Video Temporal Grounding (VTG) aims to localize relevant temporal segments in videos given natural language queries. Despite recent progress with large vision-language models (LVLMs) and instruction-tuning, existing approaches often suffer from limited temporal awareness and poor generalization. In this work, we introduce a two-stage training framework that integrates supervised fine-tuning with reinforcement learning (RL) to improve both the accuracy and robustness of VTG models. Our approach first leverages high-quality curated cold start data for SFT initialization, followed by difficulty-controlled RL to further enhance temporal localization and reasoning abilities. Comprehensive experiments on multiple VTG benchmarks demonstrate that our method consistently outperforms existing models, particularly in challenging and open-domain scenarios. We conduct an in-depth analysis of training strategies and dataset curation, highlighting the importance of both high-quality cold start data and difficulty-controlled RL. To facilitate further research and industrial adoption, we release all intermediate datasets, models, and code to the community.

Problem

Research questions and friction points this paper is trying to address.

Improving video temporal grounding accuracy and robustness

Enhancing temporal localization and reasoning abilities

Addressing limited temporal awareness and poor generalization

Innovation

Methods, ideas, or system contributions that make the work stand out.

Two-stage training with SFT and RL

Difficulty-controlled RL for localization

High-quality cold start data

🔎 Similar Papers

Grounded-VideoLLM: Sharpening Fine-grained Temporal Grounding in Video Large Language Models