How well can VLMs rate audio descriptions: A multi-dimensional quantitative assessment framework

📅 2026-02-01
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This study addresses the lack of systematic evaluation of overall quality in full-length video audio descriptions (AD), a critical gap that hinders effective support for visually impaired users. To bridge this gap, the authors propose the first multidimensional AD quality assessment framework grounded in professional accessibility guidelines and expert input. Leveraging Item Response Theory (IRT), they quantitatively compare the alignment of both vision-language models (VLMs) and human raters with expert judgments in evaluating full-length ADs. Results reveal that while VLM-generated scores exhibit strong agreement with expert assessments, their underlying reasoning processes are less reliable than those of human evaluators. The findings advocate for a hybrid evaluation system that synergistically combines VLM scalability with human judgment to enable scalable yet high-quality AD quality control.

Technology Category

Application Category

📝 Abstract
Digital video is central to communication, education, and entertainment, but without audio description (AD), blind and low-vision audiences are excluded. While crowdsourced platforms and vision-language-models (VLMs) expand AD production, quality is rarely checked systematically. Existing evaluations rely on NLP metrics and short-clip guidelines, leaving questions about what constitutes quality for full-length content and how to assess it at scale. To address these questions, we first developed a multi-dimensional assessment framework for uninterrupted, full-length video, grounded in professional guidelines and refined by accessibility specialists. Second, we integrated this framework into a comprehensive methodological workflow, utilizing Item Response Theory, to assess the proficiency of VLM and human raters against expert-established ground truth. Findings suggest that while VLMs can approximate ground-truth ratings with high alignment, their reasoning was found to be less reliable and actionable than that of human respondents. These insights show the potential of hybrid evaluation systems that leverage VLMs alongside human oversight, offering a path towards scalable AD quality control.
Problem

Research questions and friction points this paper is trying to address.

audio description
quality assessment
vision-language models
accessibility
full-length video
Innovation

Methods, ideas, or system contributions that make the work stand out.

audio description
vision-language models
multi-dimensional assessment
Item Response Theory
accessibility evaluation
🔎 Similar Papers
No similar papers found.
L
Lana Do
Northeastern University
G
Gio Jung
San Francisco State University
J
Juvenal Francisco Barajas
San Francisco State University
A
Andrew Taylor Scott
USA
S
Shasta Ihorn
USA
A
Alexander Mario Blum
Stanford University
Vassilis Athitsos
Vassilis Athitsos
Professor, Computer Science and Engineering Department, University of Texas at Arlington
Computer VisionMachine LearningData MiningGesture RecognitionSign Language Recognition
I
Ilmi Yoon
Northeastern University