Stay in your Lane: Role Specific Queries with Overlap Suppression Loss for Dense Video Captioning

📅 2026-03-11

📈 Citations: 0

✨ Influential: 0

career value

150K/year

🤖 AI Summary

This work addresses the challenges of multi-task interference and temporal redundancy in dense video captioning, where event localization and text generation often share the same query representations. To mitigate these issues, the authors propose a role-specific query mechanism that decouples localization and description tasks. They further introduce a contrastive alignment objective to enforce semantic consistency between visual and textual modalities and design an overlap suppression loss to minimize redundant temporal predictions. Additionally, a lightweight concept-level semantic enhancement module is incorporated to enrich the generated captions. Evaluated on YouCook2 and ActivityNet Captions, the proposed end-to-end framework significantly outperforms existing methods, achieving more precise non-overlapping event localization and producing semantically richer natural language descriptions.

Technology Category

Application Category

📝 Abstract

Dense Video Captioning (DVC) is a challenging multimodal task that involves temporally localizing multiple events within a video and describing them with natural language. While query-based frameworks enable the simultaneous, end-to-end processing of localization and captioning, their reliance on shared queries often leads to significant multi-task interference between the two tasks, as well as temporal redundancy in localization. In this paper, we propose utilizing role-specific queries that separate localization and captioning into independent components, allowing each to exclusively learn its role. We then employ contrastive alignment to enforce semantic consistency between the corresponding outputs, ensuring coherent behavior across the separated queries. Furthermore, we design a novel suppression mechanism in which mutual temporal overlaps across queries are penalized to tackle temporal redundancy, supervising the model to learn distinct, non-overlapping event regions for more precise localization. Additionally, we introduce a lightweight module that captures core event concepts to further enhance semantic richness in captions through concept-level representations. We demonstrate the effectiveness of our method through extensive experiments on major DVC benchmarks YouCook2 and ActivityNet Captions.

Problem

Research questions and friction points this paper is trying to address.

Dense Video Captioning

multi-task interference

temporal redundancy

query-based framework

event localization

Innovation

Methods, ideas, or system contributions that make the work stand out.

role-specific queries

overlap suppression loss

contrastive alignment