ViDscribe: Multimodal AI for Customizing Audio Description and Question Answering in Online Videos

📅 2026-03-15

📈 Citations: 0

✨ Influential: 0

career value

204K/year

🤖 AI Summary

This study addresses the limitations of existing AI-generated audio descriptions for videos, which often fail to meet the personalized needs of blind and low-vision users and lack evaluation in real-world, long-term interactive settings. To bridge this gap, the authors present a web-based platform that integrates a multimodal large language model to deliver six customizable types of audio descriptions and conversational video question-answering for YouTube videos. This work represents the first implementation of long-term, customizable AI-powered audio descriptions and interactive access in authentic user environments, validated through a longitudinal field study. Findings demonstrate that personalization significantly enhances users’ perceived effectiveness, enjoyment, and immersion, underscoring the critical role of tailored interaction in improving video accessibility for people with visual impairments.

Technology Category

Application Category

📝 Abstract

Advances in multimodal large language models enable automatic video narration and question answering (VQA), offering scalable alternatives to labor-intensive, human-authored audio descriptions (ADs) for blind and low vision (BLV) viewers. However, prior AI-driven AD systems rarely adapt to the diverse needs and preferences of BLV individuals across videos and are typically evaluated in controlled, single-session settings. We present ViDscribe, a web-based platform that integrates AI-generated ADs with six types of user customizations and a conversational VQA interface for YouTube videos. Through a longitudinal, in-the-wild study with eight BLV participants, we examine how users engage with customization and VQA features over time. Our results show sustained engagement with both features and that customized ADs improve effectiveness, enjoyment, and immersion compared to default ADs, highlighting the value of personalized, interactive video access for BLV users.

Problem

Research questions and friction points this paper is trying to address.

audio description

blind and low vision

personalization

video accessibility

user customization

Innovation

Methods, ideas, or system contributions that make the work stand out.

personalized audio description

multimodal AI

video question answering