More than a Moment: Towards Coherent Sequences of Audio Descriptions

📅 2025-10-29

📈 Citations: 0

✨ Influential: 0

career value

199K/year

🤖 AI Summary

Existing audio description (AD) methods typically generate descriptions for individual video segments independently, resulting in poor sequential coherence, redundant repetitions, and inadequate support for visually impaired users’ coherent understanding of video narratives. To address this, we propose CoherentAD—a training-free framework that explicitly models cross-segment semantic consistency via candidate description generation followed by an autoregressive sequence selection mechanism. We further introduce novel sequence-level evaluation metrics—StoryRecall and RepetitionSuppression—to quantify narrative fidelity and redundancy reduction, respectively. Experiments demonstrate that CoherentAD significantly improves narrative coherence and linguistic conciseness while preserving information completeness, outperforming independent-segment baselines. Moreover, it exhibits enhanced accessibility support across multiple video understanding tasks.

Technology Category

Application Category

📝 Abstract

Audio Descriptions (ADs) convey essential on-screen information, allowing visually impaired audiences to follow videos. To be effective, ADs must form a coherent sequence that helps listeners to visualise the unfolding scene, rather than describing isolated moments. However, most automatic methods generate each AD independently, often resulting in repetitive, incoherent descriptions. To address this, we propose a training-free method, CoherentAD, that first generates multiple candidate descriptions for each AD time interval, and then performs auto-regressive selection across the sequence to form a coherent and informative narrative. To evaluate AD sequences holistically, we introduce a sequence-level metric, StoryRecall, which measures how well the predicted ADs convey the ground truth narrative, alongside repetition metrics that capture the redundancy across consecutive AD outputs. Our method produces coherent AD sequences with enhanced narrative understanding, outperforming prior approaches that rely on independent generations.

Problem

Research questions and friction points this paper is trying to address.

Generating coherent audio description sequences for video accessibility

Reducing repetitive and isolated descriptions in automatic AD systems

Evaluating narrative flow and redundancy in sequential audio descriptions

Innovation

Methods, ideas, or system contributions that make the work stand out.

Generates multiple candidate descriptions per interval

Performs auto-regressive selection across sequences

Introduces sequence-level StoryRecall metric for evaluation

🔎 Similar Papers

Language-based Audio Moment Retrieval