SAIL: Similarity-Aware Guidance and Inter-Caption Augmentation-based Learning for Weakly-Supervised Dense Video Captioning

📅 2026-03-05
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the challenge of weakly supervised dense video captioning, where the absence of event-level temporal boundary annotations hinders accurate localization and description of multiple events. To overcome this, the authors propose a semantic-aware cross-modal alignment framework that employs a similarity-aware training objective to generate semantically coherent Gaussian masks. Additionally, they leverage a large language model to synthesize captions and design a cross-caption fusion mechanism to provide auxiliary alignment signals. This approach effectively mitigates the semantic ambiguity of masks and the sparsity of weak supervision inherent in prior methods. Experimental results demonstrate state-of-the-art performance on both the ActivityNet Captions and YouCook2 datasets, achieving significant improvements in both event localization accuracy and caption quality.

Technology Category

Application Category

📝 Abstract
Weakly-Supervised Dense Video Captioning aims to localize and describe events in videos trained only on caption annotations, without temporal boundaries. Prior work introduced an implicit supervision paradigm based on Gaussian masking and complementary captioning. However, existing method focuses merely on generating non-overlapping masks without considering their semantic relationship to corresponding events, resulting in simplistic, uniformly distributed masks that fail to capture semantically meaningful regions. Moreover, relying solely on ground-truth captions leads to sub-optimal performance due to the inherent sparsity of existing datasets. In this work, we propose SAIL, which constructs semantically-aware masks through cross-modal alignment. Our similarity aware training objective guides masks to emphasize video regions with high similarity to their corresponding event captions. Furthermore, to guide more accurate mask generation under sparse annotation settings, we introduce an LLM-based augmentation strategy that generates synthetic captions to provide additional alignment signals. These synthetic captions are incorporated through an inter-mask mechanism, providing auxiliary guidance for precise temporal localization without degrading the main objective. Experiments on ActivityNet Captions and YouCook2 demonstrate state-of-the-art performance on both captioning and localization metrics.
Problem

Research questions and friction points this paper is trying to address.

Weakly-Supervised Dense Video Captioning
semantic alignment
mask generation
caption sparsity
temporal localization
Innovation

Methods, ideas, or system contributions that make the work stand out.

Similarity-Aware Guidance
Inter-Caption Augmentation
Weakly-Supervised Dense Video Captioning
Cross-Modal Alignment
LLM-based Synthetic Captioning
Y
Ye-Chan Kim
Hanyang University, South Korea
S
SeungJu Cha
Hanyang University, South Korea
S
Si-Woo Kim
Hanyang University, South Korea
M
Minju Jeon
Hanyang University, South Korea
H
Hyungee Kim
Hanyang University, South Korea
Dong-Jin Kim
Dong-Jin Kim
Assistant Professor, Hanyang University
Computer VisionMachine LearningNatural Language ProcessingArtificial Intelligence