TimeZero: Temporal Video Grounding with Reasoning-Guided LVLM

📅 2025-03-17
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the Temporal Video Grounding (TVG) task—localizing temporal segments in long videos based on natural language queries. We propose TimeZero, a reasoning-guided multimodal large language model (LVLM). Methodologically, we introduce the first pure reinforcement learning (PPO)-driven LVLM inference paradigm for TVG, eliminating the need for intermediate step annotations and enabling zero-supervision, end-to-end video–language relational modeling. To enhance robustness for fine-grained localization in long videos, we decouple spatiotemporal understanding from language alignment and integrate a video frame feature pyramid with cross-modal attention. On the Charades-STA benchmark, TimeZero achieves state-of-the-art performance, significantly outperforming existing fully supervised and weakly supervised approaches. The code is publicly available.

Technology Category

Application Category

📝 Abstract
We introduce TimeZero, a reasoning-guided LVLM designed for the temporal video grounding (TVG) task. This task requires precisely localizing relevant video segments within long videos based on a given language query. TimeZero tackles this challenge by extending the inference process, enabling the model to reason about video-language relationships solely through reinforcement learning. To evaluate the effectiveness of TimeZero, we conduct experiments on two benchmarks, where TimeZero achieves state-of-the-art performance on Charades-STA. Code is available at https://github.com/www-Ye/TimeZero.
Problem

Research questions and friction points this paper is trying to address.

Precisely localize video segments using language queries
Extend inference process for video-language reasoning
Achieve state-of-the-art performance on Charades-STA benchmark
Innovation

Methods, ideas, or system contributions that make the work stand out.

Reasoning-guided LVLM for temporal video grounding
Reinforcement learning for video-language relationship reasoning
State-of-the-art performance on Charades-STA benchmark
🔎 Similar Papers
No similar papers found.
Y
Ye Wang
Renmin University of China
Boshen Xu
Boshen Xu
Renmin University of China
Zihao Yue
Zihao Yue
Renmin University of China
Multimodal AILanguage Modeling
Zihan Xiao
Zihan Xiao
Unknown affiliation
Z
Ziheng Wang
Renmin University of China
L
Liang Zhang
Renmin University of China
Dingyi Yang
Dingyi Yang
Renmin University of China
Natural Language ProcessingVision and LanguageGenerationEvaluation
W
Wenxuan Wang
Renmin University of China, Hong Kong University of Science and Technology
Qin Jin
Qin Jin
中国人民大学信息学院
人工智能