TimeLoc: A Unified End-to-End Framework for Precise Timestamp Localization in Long Videos

📅 2025-03-09

📈 Citations: 0

✨ Influential: 0

career value

196K/year

🤖 AI Summary

This paper addresses the cross-task challenge of precise timestamp localization in untrimmed long videos—encompassing temporal action localization, video grounding, moment retrieval, and event boundary detection. We propose the first end-to-end trainable, one-stage unified framework for these tasks. Key contributions include: (1) the first multi-task unified modeling architecture; (2) a novel temporal chunking mechanism enabling efficient processing of ultra-long videos (>30k frames); and (3) a multi-stage text encoder fine-tuning strategy to enhance text–video semantic alignment. The model jointly optimizes contrastive learning and localization losses without task-specific adaptation. Evaluated on six benchmarks—including THUMOS14 and EPIC-Kitchens-100—it establishes new state-of-the-art results across all tasks: e.g., +2.94% mAP on QVHighlights and +11.5% R1@0.5 on TACoS.

Technology Category

Application Category

📝 Abstract

Temporal localization in untrimmed videos, which aims to identify specific timestamps, is crucial for video understanding but remains challenging. This task encompasses several subtasks, including temporal action localization, temporal video grounding, moment retrieval, and generic event boundary detection. Existing methods in each subfield are typically designed for specific tasks and lack generalizability across domains. In this paper, we propose TimeLoc, a unified end-to-end framework for timestamp localization that can handle multiple tasks. First, our approach employs a simple yet effective one-stage localization model that supports text queries as input and multiple actions as output. Second, we jointly train the video encoder and localization model in an end-to-end manner. To efficiently process long videos, we introduce temporal chunking, enabling the handling of videos with over 30k frames. Third, we find that fine-tuning pre-trained text encoders with a multi-stage training strategy further enhances text-conditioned localization. TimeLoc achieves state-of-the-art results across multiple benchmarks: +1.3% and +1.9% mAP over previous best methods on THUMOS14 and EPIC-Kitchens-100, +1.1% on Kinetics-GEBD, +2.94% mAP on QVHighlights, and significant improvements in temporal video grounding (+11.5% on TACoS and +6.7% on Charades-STA under R1@0.5). Our code and checkpoints will be released at https://github.com/sming256/TimeLoc.

Problem

Research questions and friction points this paper is trying to address.

Develops a unified framework for timestamp localization in long videos.

Addresses challenges in temporal action localization and video grounding.

Enhances text-conditioned localization with multi-stage training strategy.

Innovation

Methods, ideas, or system contributions that make the work stand out.

One-stage localization model with text queries

End-to-end training of video and localization models

Temporal chunking for processing long videos

🔎 Similar Papers

VTG-LLM: Integrating Timestamp Knowledge into Video LLMs for Enhanced Video Temporal Grounding