🤖 AI Summary
This study addresses the limitations of existing audio description (AD) evaluation—namely, its confinement to short clips and neglect of real-world viewing needs of blind and low-vision (BLV) users—by proposing ADQA, the first user-centered AD evaluation framework for multi-minute coherent videos. Methodologically, we construct a dual-track aligned AD dataset and design question-answering tasks that explicitly distinguish visual fact recognition from narrative reasoning, while quantifying AD subjectivity via human annotation and comparative assessment. Key contributions include: (1) the first systematic decoupling of visual perception and narrative comprehension capabilities in AD evaluation; (2) the establishment of the first long-form, QA-driven AD benchmark with a public leaderboard; and (3) empirical evidence demonstrating that current automatic AD systems significantly underperform human-authored descriptions on both task types. This work advances AD evaluation from clip-level to narrative-level assessment and shifts the paradigm from technical metrics toward user-centered cognitive outcomes.
📝 Abstract
Audio descriptions (ADs) narrate important visual details in movies, enabling Blind and Low Vision (BLV) users to understand narratives and appreciate visual details. Existing works in automatic AD generation mostly focus on few-second trimmed clips, and evaluate them by comparing against a single ground-truth reference AD. However, writing ADs is inherently subjective. Through alignment and analysis of two independent AD tracks for the same movies, we quantify the subjectivity in when and whether to describe, and what and how to highlight. Thus, we show that working with trimmed clips is inadequate. We propose ADQA, a QA benchmark that evaluates ADs at the level of few-minute long, coherent video segments, testing whether they would help BLV users understand the story and appreciate visual details. ADQA features visual appreciation (VA) questions about visual facts and narrative understanding (NU) questions based on the plot. Through ADQA, we show that current AD generation methods lag far behind human-authored ADs. We conclude with several recommendations for future work and introduce a public leaderboard for benchmarking.