🤖 AI Summary
Temporal Video Grounding (TVG) aims to precisely localize relevant temporal segments in videos conditioned on natural language queries, yet remains challenging due to high video redundancy and demanding temporal reasoning requirements. To address these challenges, we propose a multimodal large language model tailored for TVG, featuring three core innovations: (1) an adaptive attention allocation mechanism that dynamically focuses on salient frames and temporal context; (2) an explicit timestamp-modality alignment module that establishes fine-grained temporal correspondence between linguistic queries and video frames; and (3) a group-wise relative policy optimization framework with rejection-based reinforcement learning, which actively suppresses irrelevant segments. Evaluated on QVHighlights and its revised test set, our method achieves a 3.5% absolute improvement over the state of the art, demonstrating significant gains in both localization accuracy and robustness.
📝 Abstract
Temporal Video Grounding (TVG), which requires pinpointing relevant temporal segments from video based on language query, has always been a highly challenging task in the field of video understanding. Videos often have a larger volume of information and redundancy than texts or images. Models should present comprehensive understanding of the whole video to accurately retrieve query-relevant clips. We thus propose Tempo-R0: a Video Multimodal Large Language Model (Video-MLLM) for the temporal video grounding task via multimodal temporal sensing reinforcement. Specifically, during the preprocessing stage of our pipeline, we employ Self-adaptive Attention Allocation (SAA) method based on frame content variation to efficiently use the MLLM's limited attention. The Explicit Timestamp-modal Aligned (ETA) method is also utilized to strengthen our model's capability to perceive the boundaries of events in the video. In the fine-tuning part of our pipeline, we creatively apply Partial Irrelevance Refusing-based Group Relative Policy Optimization (PIR-GRPO) in TVG area to foster model's temporal reasoning from not only accepting relevant video-query pairs but also refusing irrelevant ones. Experiments demonstrate that our method accomplishes a notable advantage over SOTA solutions by around 3.5% on both the original QVHighlights testbench and its corrected version with more reasonable ground truth annotations.