Explicit Temporal-Semantic Modeling for Dense Video Captioning via Context-Aware Cross-Modal Interaction

📅 2025-11-13
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing dense video captioning methods rely on implicit modeling, struggling to effectively capture temporal coherence among events and global visual semantics. To address this, we propose a context-aware cross-modal interaction framework that explicitly models video temporal dynamics and text-semantic alignment. Specifically, we introduce cross-modal frame aggregation for fine-grained vision-language alignment; a query-guided attention mechanism to enhance salient event features; and a context-aware feature enhancement module that incorporates pseudo-event semantics to construct event-aligned textual representations. Evaluated on ActivityNet Captions and YouCook2, our method significantly improves temporal consistency and semantic completeness of generated descriptions, achieving state-of-the-art performance. Experimental results validate the effectiveness of explicit joint modeling of temporal structure and semantic meaning in dense video captioning.

Technology Category

Application Category

📝 Abstract
Dense video captioning jointly localizes and captions salient events in untrimmed videos. Recent methods primarily focus on leveraging additional prior knowledge and advanced multi-task architectures to achieve competitive performance. However, these pipelines rely on implicit modeling that uses frame-level or fragmented video features, failing to capture the temporal coherence across event sequences and comprehensive semantics within visual contexts. To address this, we propose an explicit temporal-semantic modeling framework called Context-Aware Cross-Modal Interaction (CACMI), which leverages both latent temporal characteristics within videos and linguistic semantics from text corpus. Specifically, our model consists of two core components: Cross-modal Frame Aggregation aggregates relevant frames to extract temporally coherent, event-aligned textual features through cross-modal retrieval; and Context-aware Feature Enhancement utilizes query-guided attention to integrate visual dynamics with pseudo-event semantics. Extensive experiments on the ActivityNet Captions and YouCook2 datasets demonstrate that CACMI achieves the state-of-the-art performance on dense video captioning task.
Problem

Research questions and friction points this paper is trying to address.

Modeling temporal coherence across video event sequences
Capturing comprehensive semantics within visual contexts
Enhancing dense video captioning with explicit temporal-semantic framework
Innovation

Methods, ideas, or system contributions that make the work stand out.

Explicit temporal-semantic modeling framework for video captioning
Cross-modal frame aggregation extracts event-aligned textual features
Query-guided attention integrates visual dynamics with pseudo-event semantics
🔎 Similar Papers
M
Mingda Jia
MAIS, Institute of Automation, Chinese Academy of Sciences, Beijing, China; School of Artificial Intelligence, University of Chinese Academy of Sciences
W
Weiliang Meng
MAIS, Institute of Automation, Chinese Academy of Sciences, Beijing, China; School of Artificial Intelligence, University of Chinese Academy of Sciences
Z
Zenghuang Fu
MAIS, Institute of Automation, Chinese Academy of Sciences, Beijing, China; School of Artificial Intelligence, University of Chinese Academy of Sciences
Y
Yiheng Li
MAIS, Institute of Automation, Chinese Academy of Sciences, Beijing, China; School of Artificial Intelligence, University of Chinese Academy of Sciences
Q
Qi Zeng
MAIS, Institute of Automation, Chinese Academy of Sciences, Beijing, China; School of Artificial Intelligence, University of Chinese Academy of Sciences
Y
Yifan Zhang
MAIS, Institute of Automation, Chinese Academy of Sciences, Beijing, China; School of Artificial Intelligence, University of Chinese Academy of Sciences
J
Ju Xin
Northern Navigation Service Center, Qingdao, China
Rongtao Xu
Rongtao Xu
MBZUAI << CASIA << HUST
Intelligent RobotEmbodied AIVLAVLMSpatialtemporal AI
J
Jiguang Zhang
MAIS, Institute of Automation, Chinese Academy of Sciences, Beijing, China; School of Artificial Intelligence, University of Chinese Academy of Sciences
X
Xiaopeng Zhang
MAIS, Institute of Automation, Chinese Academy of Sciences, Beijing, China; School of Artificial Intelligence, University of Chinese Academy of Sciences