"I Can See Forever!": Evaluating Real-time VideoLLMs for Assisting Individuals with Visual Impairments

📅 2025-05-07

📈 Citations: 0

✨ Influential: 0

career value

231K/year

🤖 AI Summary

Real-time video-language models (VideoLLMs) exhibit insufficient dynamic environmental awareness for assisting visually impaired users. Method: We introduce VisAssistDaily—the first benchmark dataset covering daily living, home, and social scenarios—and SafeVid, a novel environment-risk perception dataset, coupled with an active polling mechanism to enhance dynamic hazard detection. Our approach integrates multi-turn vision-language interaction, risk-driven active querying, and user-in-the-loop evaluation, systematically assessing practicality and latency on both closed-source (GPT-4o) and open-source VideoLLMs. Results: GPT-4o achieves the highest task success rate on VisAssistDaily; SafeVid and the polling mechanism significantly improve risk detection accuracy, addressing critical gaps in existing models’ ability to identify evolving hazards. This work presents the first systematic evaluation of VideoLLMs for real-time visual assistance to blind and low-vision users, establishing a new benchmark and technical pathway toward trustworthy embodied AI assistance.

Technology Category

Application Category

📝 Abstract

The visually impaired population, especially the severely visually impaired, is currently large in scale, and daily activities pose significant challenges for them. Although many studies use large language and vision-language models to assist the blind, most focus on static content and fail to meet real-time perception needs in dynamic and complex environments, such as daily activities. To provide them with more effective intelligent assistance, it is imperative to incorporate advanced visual understanding technologies. Although real-time vision and speech interaction VideoLLMs demonstrate strong real-time visual understanding, no prior work has systematically evaluated their effectiveness in assisting visually impaired individuals. In this work, we conduct the first such evaluation. First, we construct a benchmark dataset (VisAssistDaily), covering three categories of assistive tasks for visually impaired individuals: Basic Skills, Home Life Tasks, and Social Life Tasks. The results show that GPT-4o achieves the highest task success rate. Next, we conduct a user study to evaluate the models in both closed-world and open-world scenarios, further exploring the practical challenges of applying VideoLLMs in assistive contexts. One key issue we identify is the difficulty current models face in perceiving potential hazards in dynamic environments. To address this, we build an environment-awareness dataset named SafeVid and introduce a polling mechanism that enables the model to proactively detect environmental risks. We hope this work provides valuable insights and inspiration for future research in this field.

Problem

Research questions and friction points this paper is trying to address.

Evaluating VideoLLMs for real-time assistance to visually impaired individuals

Addressing challenges in dynamic environment perception for daily tasks

Improving hazard detection in real-time visual understanding for safety

Innovation

Methods, ideas, or system contributions that make the work stand out.

Real-time VideoLLMs for dynamic visual understanding

Benchmark dataset VisAssistDaily for assistive tasks

Polling mechanism for proactive hazard detection

🔎 Similar Papers

Chrono: A Simple Blueprint for Representing Time in MLLMs