🤖 AI Summary
Video Temporal Grounding (VTG) suffers from weak temporal awareness and poor cross-domain generalization. To address these challenges, we propose a two-stage training framework: (1) supervised fine-tuning (SFT) on high-quality cold-start data to strengthen temporal boundary modeling; and (2) difficulty-adaptive reinforcement learning (RL), employing progressive sample selection to enhance robustness in temporal localization. Our method integrates Large Vision-Language Models (LVLMs) with instruction tuning, significantly improving fine-grained temporal semantic understanding. Evaluated on multiple VTG benchmarks, our approach achieves state-of-the-art (SOTA) performance—particularly excelling on complex queries and open-domain scenarios. To foster community advancement, we publicly release all datasets, models, and source code.
📝 Abstract
Video Temporal Grounding (VTG) aims to localize relevant temporal segments in videos given natural language queries. Despite recent progress with large vision-language models (LVLMs) and instruction-tuning, existing approaches often suffer from limited temporal awareness and poor generalization. In this work, we introduce a two-stage training framework that integrates supervised fine-tuning with reinforcement learning (RL) to improve both the accuracy and robustness of VTG models. Our approach first leverages high-quality curated cold start data for SFT initialization, followed by difficulty-controlled RL to further enhance temporal localization and reasoning abilities. Comprehensive experiments on multiple VTG benchmarks demonstrate that our method consistently outperforms existing models, particularly in challenging and open-domain scenarios. We conduct an in-depth analysis of training strategies and dataset curation, highlighting the importance of both high-quality cold start data and difficulty-controlled RL. To facilitate further research and industrial adoption, we release all intermediate datasets, models, and code to the community.