🤖 AI Summary
Current temporal video grounding (TVG) methods suffer from overfitting to the temporal Intersection-over-Union (tIoU) metric, compromising semantic action understanding and robustness. To address this, we propose a reverse-task collaborative optimization framework that jointly enhances localization accuracy and action-semantic alignment—without requiring additional data. We innovatively design three annotation-based reverse tasks: masked verb completion, fine-grained action recognition, and controllable video captioning. These are unified under a reinforcement learning paradigm with a semantics-aware reward function. Our approach effectively mitigates tIoU overfitting on Charades-STA: the 3B-parameter model achieves R1@0.7 = 42.3%, outperforming Time-R1 by +7.1%. Moreover, it significantly improves action comprehension and cross-modal semantic consistency.
📝 Abstract
Temporal Video Grounding (TVG) seeks to localize video segments matching a given textual query. Current methods, while optimizing for high temporal Intersection-over-Union (IoU), often overfit to this metric, compromising semantic action understanding in the video and query, a critical factor for robust TVG. To address this, we introduce Inversion Tasks for TVG (Invert4TVG), a novel framework that enhances both localization accuracy and action understanding without additional data. Our approach leverages three inversion tasks derived from existing TVG annotations: (1) Verb Completion, predicting masked action verbs in queries from video segments; (2) Action Recognition, identifying query-described actions; and (3) Video Description, generating descriptions of video segments that explicitly embed query-relevant actions. These tasks, integrated with TVG via a reinforcement learning framework with well-designed reward functions, ensure balanced optimization of localization and semantics. Experiments show our method outperforms state-of-the-art approaches, achieving a 7.1% improvement in R1@0.7 on Charades-STA for a 3B model compared to Time-R1. By inverting TVG to derive query-related actions from segments, our approach strengthens semantic understanding, significantly raising the ceiling of localization accuracy.