🤖 AI Summary
This work addresses the challenge of safe and independent navigation for people with low vision in urban environments by proposing a visual question answering (VQA)-based event mapping framework, which introduces multimodal large language models (MLLMs) into accessible navigation for the first time. The approach employs a three-level hierarchical query structure to enable fine-grained scene understanding without task-specific fine-tuning and constructs risk maps categorized into four safety levels to support path planning. Evaluated on a large-scale street view dataset spanning 20 cities across six continents, the framework compares models including ViLT, LLaVA, InstructBLIP, and Qwen-VL. Results demonstrate that generative MLLMs significantly outperform traditional classification methods, with Qwen-VL achieving the best trade-off between precision and recall, thereby validating the framework’s feasibility and strong generalization capability.
📝 Abstract
Visual impairment affects hundreds of millions of people worldwide, severely limiting their ability to navigate urban environments safely and independently. While wearable assistive devices offer a promising platform for real-time hazard detection, existing approaches rely on task-specific vision pipelines that lack flexibility and generalizability. In this work, we propose an event map framework based on visual question answering that leverages Vision-Language Models (VLMs) for pedestrian scene description and hazard identification across diverse real-world environments, using a three-level hierarchical query structure to enable fine-grained scene understanding without task-specific retraining. Model responses are aggregated into a weighted risk scoring system that maps street segments into four discrete safety categories, producing navigable risk-aware event maps for route planning. To support evaluation and future research, we introduce a geographically diverse dataset spanning 20 cities across six continents, comprising over 800 annotated images and 18,000 answered questions. We benchmark four VQA architectures -ViLT, LLaVA, InstructBLIP, and Qwen-VL- and find that generative Multimodal Large Language Models (MLLMs) substantially outperform classification-based approaches, with Qwen-VL achieving the best overall balance of precision and recall. These results demonstrate the viability of MLLMs as a flexible and generalizable foundation for assistive navigation systems for visually impaired people.