Video-OPD: Efficient Post-Training of Multimodal Large Language Models for Temporal Video Grounding via On-Policy Distillation

📅 2026-02-03

📈 Citations: 0

✨ Influential: 0

career value

179K/year

🤖 AI Summary

Existing GRPO-based temporal video grounding methods suffer from sparse reward signals and high computational overhead. To address these limitations, this work proposes the Video-OPD framework, which introduces intra-policy distillation to leverage dense token-level supervision from a teacher model and employs reverse KL divergence to optimize policy sampling trajectories, thereby aligning the training and inference distributions. Furthermore, a Teacher-Validation Disagreement Focusing (TVDF) curriculum strategy is designed to transform sparse rewards into fine-grained learning signals. The proposed approach significantly improves both training efficiency and localization performance, achieving faster convergence and lower computational cost compared to GRPO baselines.

Technology Category

Application Category

📝 Abstract

Reinforcement learning has emerged as a principled post-training paradigm for Temporal Video Grounding (TVG) due to its on-policy optimization, yet existing GRPO-based methods remain fundamentally constrained by sparse reward signals and substantial computational overhead. We propose Video-OPD, an efficient post-training framework for TVG inspired by recent advances in on-policy distillation. Video-OPD optimizes trajectories sampled directly from the current policy, thereby preserving alignment between training and inference distributions, while a frontier teacher supplies dense, token-level supervision via a reverse KL divergence objective. This formulation preserves the on-policy property critical for mitigating distributional shift, while converting sparse, episode-level feedback into fine-grained, step-wise learning signals. Building on Video-OPD, we introduce Teacher-Validated Disagreement Focusing (TVDF), a lightweight training curriculum that iteratively prioritizes trajectories that are both teacher-reliable and maximally informative for the student, thereby improving training efficiency. Empirical results demonstrate that Video-OPD consistently outperforms GRPO while achieving substantially faster convergence and lower computational cost, establishing on-policy distillation as an effective alternative to conventional reinforcement learning for TVG.

Problem

Research questions and friction points this paper is trying to address.

Temporal Video Grounding

sparse reward

computational overhead

distributional shift

multimodal large language models

Innovation

Methods, ideas, or system contributions that make the work stand out.

on-policy distillation

temporal video grounding

multimodal large language models