Probing the Gaps in ChatGPT Live Video Chat for Real-World Assistance for People who are Blind or Visually Impaired

📅 2025-08-05

📈 Citations: 0

✨ Influential: 0

career value

223K/year

🤖 AI Summary

This study addresses the limited capability of existing real-time video-assisted AI systems in supporting blind and visually impaired (BVI) users within dynamic environments. We conduct the first systematic evaluation of ChatGPT Advanced Voice with Video in authentic, ecologically valid tasks—including real-world object localization and landmark identification—using multimodal user behavior data and qualitative feedback. Results reveal that while the system performs adequately on static scene description, it exhibits critical deficiencies in dynamic contexts: spatial perception inaccuracies, hallucinated outputs, response latency, and excessive user accommodation—severely compromising reliability and safety. Our analysis identifies three core bottlenecks in real-time multimodal assistance: insufficient perceptual robustness, misaligned intervention timing, and inadequate ecological integration and privacy safeguards. Based on these findings, we propose a novel AI agent design framework for BVI assistance, centered on embodied perception enhancement, context-aware intervention, and trustworthy human–AI collaboration—providing empirical grounding and methodological guidance for next-generation accessible multimodal intelligent agents.

Technology Category

Application Category

📝 Abstract

Recent advancements in large multimodal models have provided blind or visually impaired (BVI) individuals with new capabilities to interpret and engage with the real world through interactive systems that utilize live video feeds. However, the potential benefits and challenges of such capabilities to support diverse real-world assistive tasks remain unclear. In this paper, we present findings from an exploratory study with eight BVI participants. Participants used ChatGPT's Advanced Voice with Video, a state-of-the-art live video AI released in late 2024, in various real-world scenarios, from locating objects to recognizing visual landmarks, across unfamiliar indoor and outdoor environments. Our findings indicate that current live video AI effectively provides guidance and answers for static visual scenes but falls short in delivering essential live descriptions required in dynamic situations. Despite inaccuracies in spatial and distance information, participants leveraged the provided visual information to supplement their mobility strategies. Although the system was perceived as human-like due to high-quality voice interactions, assumptions about users' visual abilities, hallucinations, generic responses, and a tendency towards sycophancy led to confusion, distrust, and potential risks for BVI users. Based on the results, we discuss implications for assistive video AI agents, including incorporating additional sensing capabilities for real-world use, determining appropriate intervention timing beyond turn-taking interactions, and addressing ecological and safety concerns.

Problem

Research questions and friction points this paper is trying to address.

Evaluating ChatGPT's live video AI for blind users' real-world assistance

Identifying gaps in dynamic scene descriptions for BVI individuals

Addressing AI limitations in accuracy, trust, and safety for BVI

Innovation

Methods, ideas, or system contributions that make the work stand out.

Live video AI for real-world assistive tasks

Multimodal models for dynamic scene descriptions

Enhanced sensing for safety and accuracy

🔎 Similar Papers

No similar papers found.