🤖 AI Summary
Existing audio description (AD) methods typically generate descriptions for individual video segments independently, resulting in poor sequential coherence, redundant repetitions, and inadequate support for visually impaired users’ coherent understanding of video narratives. To address this, we propose CoherentAD—a training-free framework that explicitly models cross-segment semantic consistency via candidate description generation followed by an autoregressive sequence selection mechanism. We further introduce novel sequence-level evaluation metrics—StoryRecall and RepetitionSuppression—to quantify narrative fidelity and redundancy reduction, respectively. Experiments demonstrate that CoherentAD significantly improves narrative coherence and linguistic conciseness while preserving information completeness, outperforming independent-segment baselines. Moreover, it exhibits enhanced accessibility support across multiple video understanding tasks.
📝 Abstract
Audio Descriptions (ADs) convey essential on-screen information, allowing visually impaired audiences to follow videos. To be effective, ADs must form a coherent sequence that helps listeners to visualise the unfolding scene, rather than describing isolated moments. However, most automatic methods generate each AD independently, often resulting in repetitive, incoherent descriptions. To address this, we propose a training-free method, CoherentAD, that first generates multiple candidate descriptions for each AD time interval, and then performs auto-regressive selection across the sequence to form a coherent and informative narrative. To evaluate AD sequences holistically, we introduce a sequence-level metric, StoryRecall, which measures how well the predicted ADs convey the ground truth narrative, alongside repetition metrics that capture the redundancy across consecutive AD outputs. Our method produces coherent AD sequences with enhanced narrative understanding, outperforming prior approaches that rely on independent generations.