🤖 AI Summary
To address semantic discontinuity and incoherent descriptions arising from frame-level feature modeling in long-video audio captioning, this paper proposes a multi-granularity vision-language collaborative modeling framework tailored for visually impaired users. The method introduces: (1) a dual visual attention mechanism that jointly captures frame-level details and scene-level semantics; (2) a temporal cross-attention module enabling fine-grained inter-frame alignment and long-range narrative modeling; and (3) a Transformer-based dual-path visual encoder integrated with multi-granularity alignment strategies, complemented by an LLM-enhanced evaluation framework. Evaluated on canonical movie key segments, the approach achieves state-of-the-art performance across BLEU, CIDEr, and LLM-based consistency metrics, demonstrating significant improvements in caption coherence, contextual consistency, and perceptual fidelity.
📝 Abstract
Audio Description is a narrated commentary designed to aid vision-impaired audiences in perceiving key visual elements in a video. While short-form video understanding has advanced rapidly, a solution for maintaining coherent long-term visual storytelling remains unresolved. Existing methods rely solely on frame-level embeddings, effectively describing object-based content but lacking contextual information across scenes. We introduce DANTE-AD, an enhanced video description model leveraging a dual-vision Transformer-based architecture to address this gap. DANTE-AD sequentially fuses both frame and scene level embeddings to improve long-term contextual understanding. We propose a novel, state-of-the-art method for sequential cross-attention to achieve contextual grounding for fine-grained audio description generation. Evaluated on a broad range of key scenes from well-known movie clips, DANTE-AD outperforms existing methods across traditional NLP metrics and LLM-based evaluations.