Tempo-R0: A Video-MLLM for Temporal Video Grounding through Efficient Temporal Sensing Reinforcement Learning

📅 2025-07-07

📈 Citations: 0

✨ Influential: 0

career value

195K/year

🤖 AI Summary

Temporal Video Grounding (TVG) aims to precisely localize relevant temporal segments in videos conditioned on natural language queries, yet remains challenging due to high video redundancy and demanding temporal reasoning requirements. To address these challenges, we propose a multimodal large language model tailored for TVG, featuring three core innovations: (1) an adaptive attention allocation mechanism that dynamically focuses on salient frames and temporal context; (2) an explicit timestamp-modality alignment module that establishes fine-grained temporal correspondence between linguistic queries and video frames; and (3) a group-wise relative policy optimization framework with rejection-based reinforcement learning, which actively suppresses irrelevant segments. Evaluated on QVHighlights and its revised test set, our method achieves a 3.5% absolute improvement over the state of the art, demonstrating significant gains in both localization accuracy and robustness.

Technology Category

Application Category

📝 Abstract

Temporal Video Grounding (TVG), which requires pinpointing relevant temporal segments from video based on language query, has always been a highly challenging task in the field of video understanding. Videos often have a larger volume of information and redundancy than texts or images. Models should present comprehensive understanding of the whole video to accurately retrieve query-relevant clips. We thus propose Tempo-R0: a Video Multimodal Large Language Model (Video-MLLM) for the temporal video grounding task via multimodal temporal sensing reinforcement. Specifically, during the preprocessing stage of our pipeline, we employ Self-adaptive Attention Allocation (SAA) method based on frame content variation to efficiently use the MLLM's limited attention. The Explicit Timestamp-modal Aligned (ETA) method is also utilized to strengthen our model's capability to perceive the boundaries of events in the video. In the fine-tuning part of our pipeline, we creatively apply Partial Irrelevance Refusing-based Group Relative Policy Optimization (PIR-GRPO) in TVG area to foster model's temporal reasoning from not only accepting relevant video-query pairs but also refusing irrelevant ones. Experiments demonstrate that our method accomplishes a notable advantage over SOTA solutions by around 3.5% on both the original QVHighlights testbench and its corrected version with more reasonable ground truth annotations.

Problem

Research questions and friction points this paper is trying to address.

Pinpointing temporal video segments from language queries

Handling large video information volume and redundancy

Enhancing temporal reasoning with multimodal reinforcement learning

Innovation

Methods, ideas, or system contributions that make the work stand out.

Self-adaptive Attention Allocation for efficient attention

Explicit Timestamp-modal Aligned for boundary perception

Partial Irrelevance Refusing-based Group Relative Policy Optimization

🔎 Similar Papers

Grounded-VideoLLM: Sharpening Fine-grained Temporal Grounding in Video Large Language Models