Datasets and Recipes for Video Temporal Grounding via Reinforcement Learning

📅 2025-07-24
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Video Temporal Grounding (VTG) suffers from weak temporal awareness and poor cross-domain generalization. To address these challenges, we propose a two-stage training framework: (1) supervised fine-tuning (SFT) on high-quality cold-start data to strengthen temporal boundary modeling; and (2) difficulty-adaptive reinforcement learning (RL), employing progressive sample selection to enhance robustness in temporal localization. Our method integrates Large Vision-Language Models (LVLMs) with instruction tuning, significantly improving fine-grained temporal semantic understanding. Evaluated on multiple VTG benchmarks, our approach achieves state-of-the-art (SOTA) performance—particularly excelling on complex queries and open-domain scenarios. To foster community advancement, we publicly release all datasets, models, and source code.

Technology Category

Application Category

📝 Abstract
Video Temporal Grounding (VTG) aims to localize relevant temporal segments in videos given natural language queries. Despite recent progress with large vision-language models (LVLMs) and instruction-tuning, existing approaches often suffer from limited temporal awareness and poor generalization. In this work, we introduce a two-stage training framework that integrates supervised fine-tuning with reinforcement learning (RL) to improve both the accuracy and robustness of VTG models. Our approach first leverages high-quality curated cold start data for SFT initialization, followed by difficulty-controlled RL to further enhance temporal localization and reasoning abilities. Comprehensive experiments on multiple VTG benchmarks demonstrate that our method consistently outperforms existing models, particularly in challenging and open-domain scenarios. We conduct an in-depth analysis of training strategies and dataset curation, highlighting the importance of both high-quality cold start data and difficulty-controlled RL. To facilitate further research and industrial adoption, we release all intermediate datasets, models, and code to the community.
Problem

Research questions and friction points this paper is trying to address.

Improving video temporal grounding accuracy and robustness
Enhancing temporal localization and reasoning abilities
Addressing limited temporal awareness and poor generalization
Innovation

Methods, ideas, or system contributions that make the work stand out.

Two-stage training with SFT and RL
Difficulty-controlled RL for localization
High-quality cold start data
🔎 Similar Papers
No similar papers found.
Ruizhe Chen
Ruizhe Chen
Zhejiang University
LLMMLLM
Z
Zhiting Fan
Zhejiang University
T
Tianze Luo
Bytedance
Heqing Zou
Heqing Zou
NTU
deep learning
Z
Zhaopeng Feng
Zhejiang University
G
Guiyang Xie
Bytedance
H
Hansheng Zhang
Bytedance
Z
Zhuochen Wang
Bytedance
Zuozhu Liu
Zuozhu Liu
Assistant Professor, Zhejiang University/University of Illinois Urbana-Champaign
deep learningvision-language modelsmedical AI
H
Huaijian Zhang
Bytedance