TimeRefine: Temporal Grounding with Time Refining Video LLM

📅 2024-12-12

🏛️ arXiv.org

📈 Citations: 0

✨ Influential: 0

career value

167K/year

🤖 AI Summary

In temporal video grounding, existing Video LLM approaches rely solely on single-step timestamp prediction, resulting in imprecise boundary localization. To address this, we propose a temporal refinement paradigm that reformulates end-to-end time prediction as a multi-step iterative optimization task: first performing coarse localization, then refining boundaries via a distance-sensitive offset prediction module, with repeated refinement cycles. Our method requires no modification to the backbone architecture and is plug-and-play. Additionally, we introduce a distance-weighted auxiliary regression loss head to enhance discriminative capability for temporally proximal proposals. Evaluated on ActivityNet and Charades-STA, our approach achieves absolute mIoU improvements of 3.6% and 5.0%, respectively—significantly outperforming state-of-the-art Video LLM baselines.

Technology Category

Application Category

📝 Abstract

Video temporal grounding aims to localize relevant temporal boundaries in a video given a textual prompt. Recent work has focused on enabling Video LLMs to perform video temporal grounding via next-token prediction of temporal timestamps. However, accurately localizing timestamps in videos remains challenging for Video LLMs when relying solely on temporal token prediction. Our proposed TimeRefine addresses this challenge in two ways. First, instead of directly predicting the start and end timestamps, we reformulate the temporal grounding task as a temporal refining task: the model first makes rough predictions and then refines them by predicting offsets to the target segment. This refining process is repeated multiple times, through which the model progressively self-improves its temporal localization accuracy. Second, to enhance the model's temporal perception capabilities, we incorporate an auxiliary prediction head that penalizes the model more if a predicted segment deviates further from the ground truth, thus encouraging the model to make closer and more accurate predictions. Our plug-and-play method can be integrated into most LLM-based temporal grounding approaches. The experimental results demonstrate that TimeRefine achieves 3.6% and 5.0% mIoU improvements on the ActivityNet and Charades-STA datasets, respectively. Code and pretrained models will be released.

Problem

Research questions and friction points this paper is trying to address.

Improves temporal grounding accuracy in videos

Refines temporal predictions through iterative offset adjustments

Enhances temporal perception with auxiliary prediction head

Innovation

Methods, ideas, or system contributions that make the work stand out.

Reformulates temporal grounding as refining task

Uses auxiliary head to enhance temporal perception

Plug-and-play integration with LLM-based approaches

🔎 Similar Papers

Grounded-VideoLLM: Sharpening Fine-grained Temporal Grounding in Video Large Language Models