Sali4Vid: Saliency-Aware Video Reweighting and Adaptive Caption Retrieval for Dense Video Captioning

📅 2025-09-04
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Dense video captioning faces two key challenges: uneven frame importance—commonly overlooked by existing methods—and inflexible fixed-length segment retrieval, which poorly adapts to scene transitions. To address these, we propose a saliency-aware joint optimization framework. Our method introduces (1) a timestamp-supervised sigmoid-based frame reweighting mechanism that dynamically emphasizes semantically critical frames, and (2) a semantic-similarity-driven adaptive video segmentation strategy that precisely captures scene boundaries. These components are jointly optimized in an end-to-end model to simultaneously improve event localization and caption generation. Extensive experiments on YouCook2 and ViTT demonstrate state-of-the-art performance, validating the effectiveness of integrating frame-level saliency modeling with semantic-adaptive segment retrieval.

Technology Category

Application Category

📝 Abstract
Dense video captioning aims to temporally localize events in video and generate captions for each event. While recent works propose end-to-end models, they suffer from two limitations: (1) applying timestamp supervision only to text while treating all video frames equally, and (2) retrieving captions from fixed-size video chunks, overlooking scene transitions. To address these, we propose Sali4Vid, a simple yet effective saliency-aware framework. We introduce Saliency-aware Video Reweighting, which converts timestamp annotations into sigmoid-based frame importance weights, and Semantic-based Adaptive Caption Retrieval, which segments videos by frame similarity to capture scene transitions and improve caption retrieval. Sali4Vid achieves state-of-the-art results on YouCook2 and ViTT, demonstrating the benefit of jointly improving video weighting and retrieval for dense video captioning
Problem

Research questions and friction points this paper is trying to address.

Addressing equal frame treatment in video captioning
Improving caption retrieval over fixed-size chunks
Capturing scene transitions for better event localization
Innovation

Methods, ideas, or system contributions that make the work stand out.

Saliency-aware Video Reweighting with sigmoid weights
Semantic-based Adaptive Caption Retrieval via segmentation
Frame similarity segmentation captures scene transitions
🔎 Similar Papers
No similar papers found.
M
MinJu Jeon
Hanyang University, South Korea
S
Si-Woo Kim
Hanyang University, South Korea
Y
Ye-Chan Kim
Hanyang University, South Korea
H
HyunGee Kim
Hanyang University, South Korea
Dong-Jin Kim
Dong-Jin Kim
Assistant Professor, Hanyang University
Computer VisionMachine LearningNatural Language ProcessingArtificial Intelligence