MUSEG: Reinforcing Video Temporal Understanding via Timestamp-Aware Multi-Segment Grounding

📅 2025-05-27

📈 Citations: 0

✨ Influential: 0

career value

188K/year

🤖 AI Summary

Current multimodal large language models (MLLMs) exhibit limited capability in fine-grained temporal reasoning over videos, particularly in precisely localizing critical events spanning non-contiguous time intervals. To address this, we propose Timestamp-Aware Multi-Paragraph Grounding (TAMG), a novel framework that explicitly segments videos into semantically coherent clips and models their temporal relationships via timestamp-aware grounding. TAMG employs a staged sparse-reward reinforcement learning paradigm to jointly align queries with multiple video segments in temporal order—without requiring dense frame-level annotations. This design significantly enhances the model’s capacity to comprehend and generalize over complex temporal structures. Evaluated on temporal localization and time-sensitive video question answering, TAMG achieves new state-of-the-art performance across multiple benchmarks—including TVR, TVC, and VATEX-QA—with average improvements of 3.2–5.7 percentage points.

Technology Category

Application Category

📝 Abstract

Video temporal understanding is crucial for multimodal large language models (MLLMs) to reason over events in videos. Despite recent advances in general video understanding, current MLLMs still struggle with fine-grained temporal reasoning. While reinforcement learning (RL) has been explored to address this issue recently, existing RL approaches remain limited in effectiveness. In this work, we propose MUSEG, a novel RL-based method that enhances temporal understanding by introducing timestamp-aware multi-segment grounding. MUSEG enables MLLMs to align queries with multiple relevant video segments, promoting more comprehensive temporal reasoning. To facilitate effective learning, we design a customized RL training recipe with phased rewards that progressively guides the model toward temporally grounded reasoning. Extensive experiments on temporal grounding and time-sensitive video QA tasks demonstrate that MUSEG significantly outperforms existing methods and generalizes well across diverse temporal understanding scenarios. View our project at https://github.com/THUNLP-MT/MUSEG.

Problem

Research questions and friction points this paper is trying to address.

Enhancing fine-grained temporal reasoning in MLLMs

Aligning queries with multiple relevant video segments

Improving effectiveness of reinforcement learning for video understanding

Innovation

Methods, ideas, or system contributions that make the work stand out.

Timestamp-aware multi-segment grounding for alignment

Customized RL training with phased rewards

Enhanced temporal reasoning in video understanding

🔎 Similar Papers

Grounded-VideoLLM: Sharpening Fine-grained Temporal Grounding in Video Large Language Models