MUSEG: Reinforcing Video Temporal Understanding via Timestamp-Aware Multi-Segment Grounding

📅 2025-05-27
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Current multimodal large language models (MLLMs) exhibit limited capability in fine-grained temporal reasoning over videos, particularly in precisely localizing critical events spanning non-contiguous time intervals. To address this, we propose Timestamp-Aware Multi-Paragraph Grounding (TAMG), a novel framework that explicitly segments videos into semantically coherent clips and models their temporal relationships via timestamp-aware grounding. TAMG employs a staged sparse-reward reinforcement learning paradigm to jointly align queries with multiple video segments in temporal order—without requiring dense frame-level annotations. This design significantly enhances the model’s capacity to comprehend and generalize over complex temporal structures. Evaluated on temporal localization and time-sensitive video question answering, TAMG achieves new state-of-the-art performance across multiple benchmarks—including TVR, TVC, and VATEX-QA—with average improvements of 3.2–5.7 percentage points.

Technology Category

Application Category

📝 Abstract
Video temporal understanding is crucial for multimodal large language models (MLLMs) to reason over events in videos. Despite recent advances in general video understanding, current MLLMs still struggle with fine-grained temporal reasoning. While reinforcement learning (RL) has been explored to address this issue recently, existing RL approaches remain limited in effectiveness. In this work, we propose MUSEG, a novel RL-based method that enhances temporal understanding by introducing timestamp-aware multi-segment grounding. MUSEG enables MLLMs to align queries with multiple relevant video segments, promoting more comprehensive temporal reasoning. To facilitate effective learning, we design a customized RL training recipe with phased rewards that progressively guides the model toward temporally grounded reasoning. Extensive experiments on temporal grounding and time-sensitive video QA tasks demonstrate that MUSEG significantly outperforms existing methods and generalizes well across diverse temporal understanding scenarios. View our project at https://github.com/THUNLP-MT/MUSEG.
Problem

Research questions and friction points this paper is trying to address.

Enhancing fine-grained temporal reasoning in MLLMs
Aligning queries with multiple relevant video segments
Improving effectiveness of reinforcement learning for video understanding
Innovation

Methods, ideas, or system contributions that make the work stand out.

Timestamp-aware multi-segment grounding for alignment
Customized RL training with phased rewards
Enhanced temporal reasoning in video understanding
🔎 Similar Papers
Fuwen Luo
Fuwen Luo
Tsinghua University
Computer Science
S
Shengfeng Lou
School of Computer Science and Technology (School of Artificial Intelligence), Zhejiang Sci-Tech University, China
C
Chi Chen
Dept. of Comp. Sci. & Tech., Institute for AI, Tsinghua University, Beijing, China
Z
Ziyue Wang
Dept. of Comp. Sci. & Tech., Institute for AI, Tsinghua University, Beijing, China
C
Chenliang Li
Tongyi Lab, Alibaba Group
Weizhou Shen
Weizhou Shen
Tongyi Lab, Alibaba Group
J
Jiyue Guo
Dept. of Comp. Sci. & Tech., Institute for AI, Tsinghua University, Beijing, China
P
Peng Li
Institute for AI Industry Research (AIR), Tsinghua University, Beijing, China
M
Ming Yan
Tongyi Lab, Alibaba Group
J
Ji Zhang
Tongyi Lab, Alibaba Group
F
Fei Huang
Tongyi Lab, Alibaba Group
Y
Yang Liu
Dept. of Comp. Sci. & Tech., Institute for AI, Tsinghua University, Beijing, China; Institute for AI Industry Research (AIR), Tsinghua University, Beijing, China