DANTE-AD: Dual-Vision Attention Network for Long-Term Audio Description

📅 2025-03-31
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address semantic discontinuity and incoherent descriptions arising from frame-level feature modeling in long-video audio captioning, this paper proposes a multi-granularity vision-language collaborative modeling framework tailored for visually impaired users. The method introduces: (1) a dual visual attention mechanism that jointly captures frame-level details and scene-level semantics; (2) a temporal cross-attention module enabling fine-grained inter-frame alignment and long-range narrative modeling; and (3) a Transformer-based dual-path visual encoder integrated with multi-granularity alignment strategies, complemented by an LLM-enhanced evaluation framework. Evaluated on canonical movie key segments, the approach achieves state-of-the-art performance across BLEU, CIDEr, and LLM-based consistency metrics, demonstrating significant improvements in caption coherence, contextual consistency, and perceptual fidelity.

Technology Category

Application Category

📝 Abstract
Audio Description is a narrated commentary designed to aid vision-impaired audiences in perceiving key visual elements in a video. While short-form video understanding has advanced rapidly, a solution for maintaining coherent long-term visual storytelling remains unresolved. Existing methods rely solely on frame-level embeddings, effectively describing object-based content but lacking contextual information across scenes. We introduce DANTE-AD, an enhanced video description model leveraging a dual-vision Transformer-based architecture to address this gap. DANTE-AD sequentially fuses both frame and scene level embeddings to improve long-term contextual understanding. We propose a novel, state-of-the-art method for sequential cross-attention to achieve contextual grounding for fine-grained audio description generation. Evaluated on a broad range of key scenes from well-known movie clips, DANTE-AD outperforms existing methods across traditional NLP metrics and LLM-based evaluations.
Problem

Research questions and friction points this paper is trying to address.

Long-term coherent visual storytelling for audio description
Lack of contextual information across video scenes
Fine-grained audio description generation with contextual grounding
Innovation

Methods, ideas, or system contributions that make the work stand out.

Dual-vision Transformer-based architecture for video description
Fuses frame and scene level embeddings sequentially
Novel sequential cross-attention for contextual grounding
🔎 Similar Papers
No similar papers found.