DATE: Dynamic Absolute Time Enhancement for Long Video Understanding

📅 2025-09-11

📈 Citations: 0

✨ Influential: 0

career value

177K/year

🤖 AI Summary

Long-video understanding faces challenges including coarse-grained temporal modeling and weak long-range dependency capture; conventional uniform frame sampling and implicit positional encoding fail to support precise temporal reasoning and critical event localization. To address this, we propose a dynamic absolute time enhancement framework: (1) a continuous timestamp injection mechanism establishes an explicit temporal reference system; (2) video sampling is reformulated as a vision-language retrieval task, jointly optimizing semantic relevance and temporal coverage; and (3) timestamp token embedding, descriptive caption generation, and a similarity-driven two-stage greedy sampling strategy are introduced. Evaluated on hour-scale long-video benchmarks, our method significantly improves absolute temporal understanding and event localization accuracy. Remarkably, a 7B-parameter model achieves performance surpassing most 72B models, setting new state-of-the-art results.

Technology Category

Application Category

📝 Abstract

Long video understanding remains a fundamental challenge for multimodal large language models (MLLMs), particularly in tasks requiring precise temporal reasoning and event localization. Existing approaches typically adopt uniform frame sampling and rely on implicit position encodings to model temporal order. However, these methods struggle with long-range dependencies, leading to critical information loss and degraded temporal comprehension. In this paper, we propose Dynamic Absolute Time Enhancement (DATE) that enhances temporal awareness in MLLMs through the Timestamp Injection Mechanism (TIM) and a semantically guided Temporal-Aware Similarity Sampling (TASS) strategy. Specifically, we interleave video frame embeddings with textual timestamp tokens to construct a continuous temporal reference system. We further reformulate the video sampling problem as a vision-language retrieval task and introduce a two-stage algorithm to ensure both semantic relevance and temporal coverage: enriching each query into a descriptive caption to better align with the vision feature, and sampling key event with a similarity-driven temporally regularized greedy strategy. Our method achieves remarkable improvements w.r.t. absolute time understanding and key event localization, resulting in state-of-the-art performance among 7B and 72B models on hour-long video benchmarks. Particularly, our 7B model even exceeds many 72B models on some benchmarks.

Problem

Research questions and friction points this paper is trying to address.

Enhancing temporal awareness in long video understanding

Addressing long-range dependency issues in video analysis

Improving event localization and temporal reasoning accuracy

Innovation

Methods, ideas, or system contributions that make the work stand out.

Timestamp Injection Mechanism for temporal awareness

Temporally regularized greedy sampling strategy

Vision-language retrieval reformulation for video sampling

🔎 Similar Papers

VTG-LLM: Integrating Timestamp Knowledge into Video LLMs for Enhanced Video Temporal Grounding