Urban Risk-Aware Navigation via VQA-Based Event Maps for People with Low Vision

📅 2026-05-12

📈 Citations: 0

✨ Influential: 0

career value

193K/year

🤖 AI Summary

This work addresses the challenge of safe and independent navigation for people with low vision in urban environments by proposing a visual question answering (VQA)-based event mapping framework, which introduces multimodal large language models (MLLMs) into accessible navigation for the first time. The approach employs a three-level hierarchical query structure to enable fine-grained scene understanding without task-specific fine-tuning and constructs risk maps categorized into four safety levels to support path planning. Evaluated on a large-scale street view dataset spanning 20 cities across six continents, the framework compares models including ViLT, LLaVA, InstructBLIP, and Qwen-VL. Results demonstrate that generative MLLMs significantly outperform traditional classification methods, with Qwen-VL achieving the best trade-off between precision and recall, thereby validating the framework’s feasibility and strong generalization capability.

📝 Abstract

Visual impairment affects hundreds of millions of people worldwide, severely limiting their ability to navigate urban environments safely and independently. While wearable assistive devices offer a promising platform for real-time hazard detection, existing approaches rely on task-specific vision pipelines that lack flexibility and generalizability. In this work, we propose an event map framework based on visual question answering that leverages Vision-Language Models (VLMs) for pedestrian scene description and hazard identification across diverse real-world environments, using a three-level hierarchical query structure to enable fine-grained scene understanding without task-specific retraining. Model responses are aggregated into a weighted risk scoring system that maps street segments into four discrete safety categories, producing navigable risk-aware event maps for route planning. To support evaluation and future research, we introduce a geographically diverse dataset spanning 20 cities across six continents, comprising over 800 annotated images and 18,000 answered questions. We benchmark four VQA architectures -ViLT, LLaVA, InstructBLIP, and Qwen-VL- and find that generative Multimodal Large Language Models (MLLMs) substantially outperform classification-based approaches, with Qwen-VL achieving the best overall balance of precision and recall. These results demonstrate the viability of MLLMs as a flexible and generalizable foundation for assistive navigation systems for visually impaired people.

Problem

Research questions and friction points this paper is trying to address.

low vision

urban navigation

hazard detection

risk-aware

assistive technology

Innovation

Methods, ideas, or system contributions that make the work stand out.

Visual Question Answering

Vision-Language Models

Risk-Aware Navigation