Implicit Location-Caption Alignment via Complementary Masking for Weakly-Supervised Dense Video Captioning

📅 2024-12-17

🏛️ arXiv.org

📈 Citations: 0

✨ Influential: 0

career value

200K/year

🤖 AI Summary

This work addresses dense video captioning under weak supervision without event boundary annotations. To jointly model event localization and caption generation, we propose an implicit spatiotemporal alignment paradigm between video segments and textual descriptions. Our method eliminates explicit event proposal generation and hard alignment constraints; instead, it introduces a differentiable complementary masking mechanism—comprising positive and negative soft masks—to implicitly infer event boundaries. These masks guide a dual-modality caption generator, which is optimized end-to-end via a contrastive complementarity loss and a reconstruction consistency constraint. To our knowledge, this is the first framework that implicitly learns spatiotemporal alignment solely from unsegmented video-caption pairs, without any temporal annotations. Extensive experiments demonstrate substantial improvements over prior weakly supervised approaches, achieving performance competitive with fully supervised state-of-the-art methods on benchmarks including ActivityNet Captions.

Technology Category

Application Category

📝 Abstract

Weakly-Supervised Dense Video Captioning (WSDVC) aims to localize and describe all events of interest in a video without requiring annotations of event boundaries. This setting poses a great challenge in accurately locating the temporal location of event, as the relevant supervision is unavailable. Existing methods rely on explicit alignment constraints between event locations and captions, which involve complex event proposal procedures during both training and inference. To tackle this problem, we propose a novel implicit location-caption alignment paradigm by complementary masking, which simplifies the complex event proposal and localization process while maintaining effectiveness. Specifically, our model comprises two components: a dual-mode video captioning module and a mask generation module. The dual-mode video captioning module captures global event information and generates descriptive captions, while the mask generation module generates differentiable positive and negative masks for localizing the events. These masks enable the implicit alignment of event locations and captions by ensuring that captions generated from positively and negatively masked videos are complementary, thereby forming a complete video description. In this way, even under weak supervision, the event location and event caption can be aligned implicitly. Extensive experiments on the public datasets demonstrate that our method outperforms existing weakly-supervised methods and achieves competitive results compared to fully-supervised methods.

Problem

Research questions and friction points this paper is trying to address.

Video-Subtitle Alignment

Time Annotation

Educational Videos

Innovation

Methods, ideas, or system contributions that make the work stand out.

Weakly Supervised Learning

Dense Video Captioning

Complementary Mask Alignment

🔎 Similar Papers

Chrono: A Simple Blueprint for Representing Time in MLLMs