Invert4TVG: A Temporal Video Grounding Framework with Inversion Tasks for Enhanced Action Understanding

📅 2025-08-10
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Current temporal video grounding (TVG) methods suffer from overfitting to the temporal Intersection-over-Union (tIoU) metric, compromising semantic action understanding and robustness. To address this, we propose a reverse-task collaborative optimization framework that jointly enhances localization accuracy and action-semantic alignment—without requiring additional data. We innovatively design three annotation-based reverse tasks: masked verb completion, fine-grained action recognition, and controllable video captioning. These are unified under a reinforcement learning paradigm with a semantics-aware reward function. Our approach effectively mitigates tIoU overfitting on Charades-STA: the 3B-parameter model achieves R1@0.7 = 42.3%, outperforming Time-R1 by +7.1%. Moreover, it significantly improves action comprehension and cross-modal semantic consistency.

Technology Category

Application Category

📝 Abstract
Temporal Video Grounding (TVG) seeks to localize video segments matching a given textual query. Current methods, while optimizing for high temporal Intersection-over-Union (IoU), often overfit to this metric, compromising semantic action understanding in the video and query, a critical factor for robust TVG. To address this, we introduce Inversion Tasks for TVG (Invert4TVG), a novel framework that enhances both localization accuracy and action understanding without additional data. Our approach leverages three inversion tasks derived from existing TVG annotations: (1) Verb Completion, predicting masked action verbs in queries from video segments; (2) Action Recognition, identifying query-described actions; and (3) Video Description, generating descriptions of video segments that explicitly embed query-relevant actions. These tasks, integrated with TVG via a reinforcement learning framework with well-designed reward functions, ensure balanced optimization of localization and semantics. Experiments show our method outperforms state-of-the-art approaches, achieving a 7.1% improvement in R1@0.7 on Charades-STA for a 3B model compared to Time-R1. By inverting TVG to derive query-related actions from segments, our approach strengthens semantic understanding, significantly raising the ceiling of localization accuracy.
Problem

Research questions and friction points this paper is trying to address.

Overfitting to temporal IoU metric compromises semantic action understanding
Need enhanced localization accuracy without additional training data
Balancing optimization between temporal localization and semantic comprehension
Innovation

Methods, ideas, or system contributions that make the work stand out.

Inversion tasks enhance action understanding
Reinforcement learning optimizes localization and semantics
Verb completion and action recognition from segments
🔎 Similar Papers
No similar papers found.