🤖 AI Summary
Dense video captioning faces two key challenges: uneven frame importance—commonly overlooked by existing methods—and inflexible fixed-length segment retrieval, which poorly adapts to scene transitions. To address these, we propose a saliency-aware joint optimization framework. Our method introduces (1) a timestamp-supervised sigmoid-based frame reweighting mechanism that dynamically emphasizes semantically critical frames, and (2) a semantic-similarity-driven adaptive video segmentation strategy that precisely captures scene boundaries. These components are jointly optimized in an end-to-end model to simultaneously improve event localization and caption generation. Extensive experiments on YouCook2 and ViTT demonstrate state-of-the-art performance, validating the effectiveness of integrating frame-level saliency modeling with semantic-adaptive segment retrieval.
📝 Abstract
Dense video captioning aims to temporally localize events in video and generate captions for each event. While recent works propose end-to-end models, they suffer from two limitations: (1) applying timestamp supervision only to text while treating all video frames equally, and (2) retrieving captions from fixed-size video chunks, overlooking scene transitions. To address these, we propose Sali4Vid, a simple yet effective saliency-aware framework. We introduce Saliency-aware Video Reweighting, which converts timestamp annotations into sigmoid-based frame importance weights, and Semantic-based Adaptive Caption Retrieval, which segments videos by frame similarity to capture scene transitions and improve caption retrieval. Sali4Vid achieves state-of-the-art results on YouCook2 and ViTT, demonstrating the benefit of jointly improving video weighting and retrieval for dense video captioning