π€ AI Summary
This work addresses the limitations of existing audio description (AD) tools, which rely heavily on visual interfaces and thus fail to meet the needs of blind and low-vision (BLV) video creators. The authors propose ADCanvas, the first end-to-end AD authoring system that integrates a conversational multimodal large language model with an accessible editing environment. Designed for full compatibility with screen readers, keyboard-driven playback controls, and plain-text editing, ADCanvas also incorporates real-time visual question answering (VQA) capabilities. A user study with 12 BLV creators demonstrates that the system effectively serves as both an informational assistant and a draft-generation tool, enabling users to efficiently produce and refine AD scripts while retaining full creative control. These findings validate the systemβs usability and practical utility in real-world AD creation workflows.
π Abstract
Audio Description (AD) provides essential access to visual media for blind and low vision (BLV) audiences. Yet current AD production tools remain largely inaccessible to BLV video creators, who possess valuable expertise but face barriers due to visually-driven interfaces. We present ADCanvas, a multimodal authoring system that supports non-visual control over audio description (AD) creation. ADCanvas combines conversational interaction with keyboard-based playback control and a plain-text, screen reader-accessible editor to support end-to-end AD authoring and visual question answering (VQA). Combining screen-reader-friendly controls with a multimodal LLM agent, ADCanvas supports live VQA, script generation, and AD modification. Through a user study with 12 BLV video creators, we find that users adopt the conversational agent as an informational aide and drafting assistant, while maintaining agency through verification and editing. For example, participants saw themselves as curators who received information from the model and filtered it down for their audience. Our findings offer design implications for accessible media tools, including precise editing controls, accessibility support for creative ideation, and configurable rules for human-AI collaboration.