TA-Prompting: Enhancing Video Large Language Models for Dense Video Captioning via Temporal Anchors

📅 2026-01-06

🏛️ arXiv.org

📈 Citations: 0

✨ Influential: 0

career value

148K/year

🤖 AI Summary

This work addresses the challenge that existing video large language models struggle to precisely localize event boundaries in untrimmed long videos, leading to temporally inaccurate dense descriptions. To overcome this limitation, the authors propose a temporal anchoring mechanism that explicitly models the start and end timestamps of events, thereby enhancing the model’s temporal awareness. Coupled with a coherence-aware event sampling strategy, this approach enables effective cross-modal alignment and generates highly coherent descriptions. The method is the first to support accurate temporal understanding of an arbitrary number of events and achieves state-of-the-art performance across multiple benchmarks, significantly outperforming current models in dense video captioning, moment retrieval, and temporal question answering tasks.

Technology Category

Application Category

📝 Abstract

Dense video captioning aims to interpret and describe all temporally localized events throughout an input video. Recent state-of-the-art methods leverage large language models (LLMs) to provide detailed moment descriptions for video data. However, existing VideoLLMs remain challenging in identifying precise event boundaries in untrimmed videos, causing the generated captions to be not properly grounded. In this paper, we propose TA-Prompting, which enhances VideoLLMs via Temporal Anchors that learn to precisely localize events and prompt the VideoLLMs to perform temporal-aware video event understanding. During inference, in order to properly determine the output caption sequence from an arbitrary number of events presented within a video, we introduce an event coherent sampling strategy to select event captions with sufficient coherence across temporal events and cross-modal similarity with the given video. Through extensive experiments on benchmark datasets, we show that our TA-Prompting is favorable against state-of-the-art VideoLLMs, yielding superior performance on dense video captioning and temporal understanding tasks including moment retrieval and temporalQA.

Problem

Research questions and friction points this paper is trying to address.

dense video captioning

temporal localization

VideoLLMs

event boundaries

temporal grounding

Innovation

Methods, ideas, or system contributions that make the work stand out.

Temporal Anchors

Video Large Language Models

Dense Video Captioning