🤖 AI Summary
Existing dense video captioning methods rely on implicit modeling, struggling to effectively capture temporal coherence among events and global visual semantics. To address this, we propose a context-aware cross-modal interaction framework that explicitly models video temporal dynamics and text-semantic alignment. Specifically, we introduce cross-modal frame aggregation for fine-grained vision-language alignment; a query-guided attention mechanism to enhance salient event features; and a context-aware feature enhancement module that incorporates pseudo-event semantics to construct event-aligned textual representations. Evaluated on ActivityNet Captions and YouCook2, our method significantly improves temporal consistency and semantic completeness of generated descriptions, achieving state-of-the-art performance. Experimental results validate the effectiveness of explicit joint modeling of temporal structure and semantic meaning in dense video captioning.
📝 Abstract
Dense video captioning jointly localizes and captions salient events in untrimmed videos. Recent methods primarily focus on leveraging additional prior knowledge and advanced multi-task architectures to achieve competitive performance. However, these pipelines rely on implicit modeling that uses frame-level or fragmented video features, failing to capture the temporal coherence across event sequences and comprehensive semantics within visual contexts. To address this, we propose an explicit temporal-semantic modeling framework called Context-Aware Cross-Modal Interaction (CACMI), which leverages both latent temporal characteristics within videos and linguistic semantics from text corpus. Specifically, our model consists of two core components: Cross-modal Frame Aggregation aggregates relevant frames to extract temporally coherent, event-aligned textual features through cross-modal retrieval; and Context-aware Feature Enhancement utilizes query-guided attention to integrate visual dynamics with pseudo-event semantics. Extensive experiments on the ActivityNet Captions and YouCook2 datasets demonstrate that CACMI achieves the state-of-the-art performance on dense video captioning task.