How well can VLMs rate audio descriptions: A multi-dimensional quantitative assessment framework

📅 2026-02-01

📈 Citations: 0

✨ Influential: 0

career value

203K/year

🤖 AI Summary

This study addresses the lack of systematic evaluation of overall quality in full-length video audio descriptions (AD), a critical gap that hinders effective support for visually impaired users. To bridge this gap, the authors propose the first multidimensional AD quality assessment framework grounded in professional accessibility guidelines and expert input. Leveraging Item Response Theory (IRT), they quantitatively compare the alignment of both vision-language models (VLMs) and human raters with expert judgments in evaluating full-length ADs. Results reveal that while VLM-generated scores exhibit strong agreement with expert assessments, their underlying reasoning processes are less reliable than those of human evaluators. The findings advocate for a hybrid evaluation system that synergistically combines VLM scalability with human judgment to enable scalable yet high-quality AD quality control.

Technology Category

Application Category

📝 Abstract

Digital video is central to communication, education, and entertainment, but without audio description (AD), blind and low-vision audiences are excluded. While crowdsourced platforms and vision-language-models (VLMs) expand AD production, quality is rarely checked systematically. Existing evaluations rely on NLP metrics and short-clip guidelines, leaving questions about what constitutes quality for full-length content and how to assess it at scale. To address these questions, we first developed a multi-dimensional assessment framework for uninterrupted, full-length video, grounded in professional guidelines and refined by accessibility specialists. Second, we integrated this framework into a comprehensive methodological workflow, utilizing Item Response Theory, to assess the proficiency of VLM and human raters against expert-established ground truth. Findings suggest that while VLMs can approximate ground-truth ratings with high alignment, their reasoning was found to be less reliable and actionable than that of human respondents. These insights show the potential of hybrid evaluation systems that leverage VLMs alongside human oversight, offering a path towards scalable AD quality control.

Problem

Research questions and friction points this paper is trying to address.

audio description

quality assessment

vision-language models

accessibility

full-length video

Innovation

Methods, ideas, or system contributions that make the work stand out.

audio description

vision-language models

multi-dimensional assessment