🤖 AI Summary
Existing GRPO-based temporal video grounding methods suffer from sparse reward signals and high computational overhead. To address these limitations, this work proposes the Video-OPD framework, which introduces intra-policy distillation to leverage dense token-level supervision from a teacher model and employs reverse KL divergence to optimize policy sampling trajectories, thereby aligning the training and inference distributions. Furthermore, a Teacher-Validation Disagreement Focusing (TVDF) curriculum strategy is designed to transform sparse rewards into fine-grained learning signals. The proposed approach significantly improves both training efficiency and localization performance, achieving faster convergence and lower computational cost compared to GRPO baselines.
📝 Abstract
Reinforcement learning has emerged as a principled post-training paradigm for Temporal Video Grounding (TVG) due to its on-policy optimization, yet existing GRPO-based methods remain fundamentally constrained by sparse reward signals and substantial computational overhead. We propose Video-OPD, an efficient post-training framework for TVG inspired by recent advances in on-policy distillation. Video-OPD optimizes trajectories sampled directly from the current policy, thereby preserving alignment between training and inference distributions, while a frontier teacher supplies dense, token-level supervision via a reverse KL divergence objective. This formulation preserves the on-policy property critical for mitigating distributional shift, while converting sparse, episode-level feedback into fine-grained, step-wise learning signals. Building on Video-OPD, we introduce Teacher-Validated Disagreement Focusing (TVDF), a lightweight training curriculum that iteratively prioritizes trajectories that are both teacher-reliable and maximally informative for the student, thereby improving training efficiency. Empirical results demonstrate that Video-OPD consistently outperforms GRPO while achieving substantially faster convergence and lower computational cost, establishing on-policy distillation as an effective alternative to conventional reinforcement learning for TVG.