VideoTG-R1: Boosting Video Temporal Grounding via Curriculum Reinforcement Learning on Reflected Boundary Annotations

📅 2025-10-27
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address training inefficiency in video temporal grounding (VTG)—specifically, ambiguous supervision from partial annotations and sparse, indistinguishable rewards caused by hard samples—this paper proposes a novel curriculum reinforcement learning framework. Methodologically: (1) a boundary-reflection agent automatically identifies and filters ambiguous partially annotated samples to enhance supervision quality; (2) a difficulty-assessment agent coupled with a dynamic curriculum scheduling strategy progressively incorporates hard samples, mitigating training interference. The framework integrates multimodal large language models, reinforcement learning, boundary-reflection mechanisms, and difficulty-aware curriculum learning. Experiments demonstrate that, using only 10% of the training data and 21% of the computational cost, our method significantly outperforms full-data baselines on both VTG and grounded video question answering tasks, achieving efficient and robust temporal localization.

Technology Category

Application Category

📝 Abstract
Video temporal grounding (VTG) aims to locate precise segments in videos based on language queries, which is a fundamental challenge in video understanding. While recent Multimodal Large Language Models (MLLMs) have shown promise in tackling VTG through reinforcement learning (RL), they overlook the challenges arising from both the quality and difficulty of training samples. (1) Partially annotated samples. Many samples contain relevant segments beyond the annotated interval, introducing ambiguous supervision. (2) Hard-to-ground samples. Samples with poor zero-shot performance produce consistently low and indistinguishable rewards during RL training, exhibiting no clear preference among multiple outputs and thus hindering learning efficiency. To address these challenges, we propose VideoTG-R1, a novel curriculum RL framework with reflected boundary annotations, enabling data-efficient training. Specifically, we propose a Boundary Reflection Agent that utilizes MLLMs to predict query-relevant timestamps outside the annotated intervals, allowing us to identify and filter out partially annotated samples, thereby reducing ambiguity. Furthermore, we introduce a Difficulty Estimation Agent to assess the training difficulty of each sample and design a curriculum RL strategy that dynamically masks the videos of hard-to-ground samples according to the training steps, easing the training difficulty and providing clearer preference. Experiments on the VTG and grounded VideoQA tasks demonstrate the effectiveness of our method. Remarkably, with only 10% of the training samples and 21% of the computational budget, VideoTG-R1 outperforms full-data counterparts under both group relative policy optimization (GRPO) and supervised fine-tuning (SFT). The code is available at https://github.com/ldong1111/VideoTG-R1.
Problem

Research questions and friction points this paper is trying to address.

Addresses ambiguous supervision from partially annotated video segments
Solves indistinguishable rewards for hard-to-ground samples during RL training
Improves learning efficiency in video temporal grounding with limited data
Innovation

Methods, ideas, or system contributions that make the work stand out.

Curriculum reinforcement learning with boundary reflection agent
Difficulty estimation agent for dynamic video masking
Reflected boundary annotations to filter ambiguous samples
L
Lu Dong
University of Science and Technology of China, Hefei 230027, China, and also with Shanghai Artificial Intelligence Laboratory, Shanghai 202150, China
Haiyu Zhang
Haiyu Zhang
Beihang University
Neural Fields
H
Han Lin
Shanghai Jiao Tong University, Shanghai 200240, China, and also with Shanghai Artificial Intelligence Laboratory, Shanghai 202150, China
Z
Ziang Yan
Zhejiang University, Hangzhou 310058, China, and also with Shanghai Artificial Intelligence Laboratory, Shanghai 202150, China
X
Xiangyu Zeng
State Key Laboratory for Novel Software Technology, Nanjing University, Nanjing 210023, China, and also with Shanghai Artificial Intelligence Laboratory, Shanghai 202150, China
Hongjie Zhang
Hongjie Zhang
Nanjing University; Shanghai Artificial Intelligence Laboratory
Computer Vision
Y
Yifei Huang
Shanghai Artificial Intelligence Laboratory, Shanghai 202150, China
Y
Yi Wang
Shanghai Artificial Intelligence Laboratory, Shanghai 202150, China
Z
Zhen-Hua Ling
University of Science and Technology of China, Hefei 230027, China
L
Limin Wang
State Key Laboratory for Novel Software Technology, Nanjing University, Nanjing 210023, China, and also with Shanghai Artificial Intelligence Laboratory, Shanghai 202150, China
Y
Yali Wang
Shenzhen Institute of Advanced Technology, Chinese Academy of Sciences, Shenzhen 518055, China, and also with Shanghai Artificial Intelligence Laboratory, Shanghai 202150, China