Implicit Location-Caption Alignment via Complementary Masking for Weakly-Supervised Dense Video Captioning

📅 2024-12-17
🏛️ arXiv.org
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses dense video captioning under weak supervision without event boundary annotations. To jointly model event localization and caption generation, we propose an implicit spatiotemporal alignment paradigm between video segments and textual descriptions. Our method eliminates explicit event proposal generation and hard alignment constraints; instead, it introduces a differentiable complementary masking mechanism—comprising positive and negative soft masks—to implicitly infer event boundaries. These masks guide a dual-modality caption generator, which is optimized end-to-end via a contrastive complementarity loss and a reconstruction consistency constraint. To our knowledge, this is the first framework that implicitly learns spatiotemporal alignment solely from unsegmented video-caption pairs, without any temporal annotations. Extensive experiments demonstrate substantial improvements over prior weakly supervised approaches, achieving performance competitive with fully supervised state-of-the-art methods on benchmarks including ActivityNet Captions.

Technology Category

Application Category

📝 Abstract
Weakly-Supervised Dense Video Captioning (WSDVC) aims to localize and describe all events of interest in a video without requiring annotations of event boundaries. This setting poses a great challenge in accurately locating the temporal location of event, as the relevant supervision is unavailable. Existing methods rely on explicit alignment constraints between event locations and captions, which involve complex event proposal procedures during both training and inference. To tackle this problem, we propose a novel implicit location-caption alignment paradigm by complementary masking, which simplifies the complex event proposal and localization process while maintaining effectiveness. Specifically, our model comprises two components: a dual-mode video captioning module and a mask generation module. The dual-mode video captioning module captures global event information and generates descriptive captions, while the mask generation module generates differentiable positive and negative masks for localizing the events. These masks enable the implicit alignment of event locations and captions by ensuring that captions generated from positively and negatively masked videos are complementary, thereby forming a complete video description. In this way, even under weak supervision, the event location and event caption can be aligned implicitly. Extensive experiments on the public datasets demonstrate that our method outperforms existing weakly-supervised methods and achieves competitive results compared to fully-supervised methods.
Problem

Research questions and friction points this paper is trying to address.

Video-Subtitle Alignment
Time Annotation
Educational Videos
Innovation

Methods, ideas, or system contributions that make the work stand out.

Weakly Supervised Learning
Dense Video Captioning
Complementary Mask Alignment
🔎 Similar Papers
Shiping Ge
Shiping Ge
Independent Researcher
Multimodal LearningData Mining
Q
Qiang Chen
Tencent WeChat, Guangzhou, China
Zhiwei Jiang
Zhiwei Jiang
Nanjing University
Natural Language Processing
Y
Yafeng Yin
State Key Laboratory for Novel Software Technology, Nanjing University, Nanjing, China
L
Liu Qin
Tencent WeChat, Guangzhou, China
Ziyao Chen
Ziyao Chen
Tencent WeChat, Guangzhou, China
Qing Gu
Qing Gu
Nanjing University