TAR-TVG: Enhancing VLMs with Timestamp Anchor-Constrained Reasoning for Temporal Video Grounding

📅 2025-08-11
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address the instability in temporal prediction caused by the lack of explicit constraints on the reasoning process in reinforcement learning–based temporal video grounding (TVG), this paper introduces a timestamp anchoring mechanism: intermediate supervision points are injected into the reasoning chain to enforce stepwise refinement of temporal estimates, ensuring each reasoning step meaningfully contributes to the final prediction. We further propose a three-stage self-distillation training strategy—initial training via GRPO, supervised fine-tuning (SFT) using high-quality reasoning trajectories, and subsequent GRPO re-optimization—to enhance anchor generation quality. Built upon vision-language models such as Qwen2.5-VL-3B, our method achieves state-of-the-art performance across multiple benchmarks. It produces interpretable, verifiable, and progressively refined reasoning chains, significantly improving both accuracy and reliability in long-video understanding.

Technology Category

Application Category

📝 Abstract
Temporal Video Grounding (TVG) aims to precisely localize video segments corresponding to natural language queries, which is a critical capability for long-form video understanding. Although existing reinforcement learning approaches encourage models to generate reasoning chains before predictions, they fail to explicitly constrain the reasoning process to ensure the quality of the final temporal predictions. To address this limitation, we propose Timestamp Anchor-constrained Reasoning for Temporal Video Grounding (TAR-TVG), a novel framework that introduces timestamp anchors within the reasoning process to enforce explicit supervision to the thought content. These anchors serve as intermediate verification points. More importantly, we require each reasoning step to produce increasingly accurate temporal estimations, thereby ensuring that the reasoning process contributes meaningfully to the final prediction. To address the challenge of low-probability anchor generation in models (e.g., Qwen2.5-VL-3B), we develop an efficient self-distillation training strategy: (1) initial GRPO training to collect 30K high-quality reasoning traces containing multiple timestamp anchors, (2) supervised fine-tuning (SFT) on distilled data, and (3) final GRPO optimization on the SFT-enhanced model. This three-stage training strategy enables robust anchor generation while maintaining reasoning quality. Experiments show that our model achieves state-of-the-art performance while producing interpretable, verifiable reasoning chains with progressively refined temporal estimations.
Problem

Research questions and friction points this paper is trying to address.

Improve temporal video grounding accuracy with constrained reasoning
Enforce explicit supervision through timestamp anchor verification
Address low-probability anchor generation in vision-language models
Innovation

Methods, ideas, or system contributions that make the work stand out.

Timestamp anchors enforce explicit supervision during reasoning process
Self-distillation training with GRPO and SFT for robust anchors
Progressive temporal estimation refinement through constrained reasoning steps
🔎 Similar Papers
No similar papers found.
C
Chaohong Guo
South China University of Technology
X
Xun Mo
South China University of Technology
Yongwei Nie
Yongwei Nie
South China University of Technology
Computer GraphicsComputer Vision
X
Xuemiao Xu
South China University of Technology
C
Chao Xu
Nanjing Audit University
F
Fei Yu
Guangdong Laboratory of Artificial Intelligence and Digital Economy (SZ)
Chengjiang Long
Chengjiang Long
Research Engineer/Tech Leader at ByteDance Inc.
Computer VisionComputer GraphicsMultimediaMachine LearningArtificial Intelligence