🤖 AI Summary
Current multimodal large language models (MLLMs) exhibit limited capability in fine-grained temporal reasoning over videos, particularly in precisely localizing critical events spanning non-contiguous time intervals. To address this, we propose Timestamp-Aware Multi-Paragraph Grounding (TAMG), a novel framework that explicitly segments videos into semantically coherent clips and models their temporal relationships via timestamp-aware grounding. TAMG employs a staged sparse-reward reinforcement learning paradigm to jointly align queries with multiple video segments in temporal order—without requiring dense frame-level annotations. This design significantly enhances the model’s capacity to comprehend and generalize over complex temporal structures. Evaluated on temporal localization and time-sensitive video question answering, TAMG achieves new state-of-the-art performance across multiple benchmarks—including TVR, TVC, and VATEX-QA—with average improvements of 3.2–5.7 percentage points.
📝 Abstract
Video temporal understanding is crucial for multimodal large language models (MLLMs) to reason over events in videos. Despite recent advances in general video understanding, current MLLMs still struggle with fine-grained temporal reasoning. While reinforcement learning (RL) has been explored to address this issue recently, existing RL approaches remain limited in effectiveness. In this work, we propose MUSEG, a novel RL-based method that enhances temporal understanding by introducing timestamp-aware multi-segment grounding. MUSEG enables MLLMs to align queries with multiple relevant video segments, promoting more comprehensive temporal reasoning. To facilitate effective learning, we design a customized RL training recipe with phased rewards that progressively guides the model toward temporally grounded reasoning. Extensive experiments on temporal grounding and time-sensitive video QA tasks demonstrate that MUSEG significantly outperforms existing methods and generalizes well across diverse temporal understanding scenarios. View our project at https://github.com/THUNLP-MT/MUSEG.