SceneScout: Towards AI Agent-driven Access to Street View Imagery for Blind Users

📅 2025-04-12

📈 Citations: 0

✨ Influential: 0

career value

205K/year

🤖 AI Summary

Visually impaired users frequently experience uncertainty when navigating unfamiliar environments due to the absence of visual contextual cues. Existing pre-travel tools provide only static landmarks and turn-by-turn instructions, insufficient for developing robust spatial understanding. To address this, we propose the first street-view image–accessible AI agent specifically designed for visually impaired users. Our agent integrates multimodal large language models (MLLMs), street-view scene understanding, spatial semantic modeling, and trustworthy descriptive generation to support two interaction modes: route preview and virtual exploration—transforming static street-view imagery into dynamic, speech-based, semantically rich environmental descriptions. It introduces the first MLLM-centric, interactive street-view understanding paradigm, enabling a semantic leap from raw pixels to spatial cognition. A user study (N=10) demonstrates significant improvements in environmental awareness; technical evaluation shows 72% description accuracy, 95% coverage of critical visual elements, and robust performance on low-quality or outdated imagery.

Technology Category

Application Category

📝 Abstract

People who are blind or have low vision (BLV) may hesitate to travel independently in unfamiliar environments due to uncertainty about the physical landscape. While most tools focus on in-situ navigation, those exploring pre-travel assistance typically provide only landmarks and turn-by-turn instructions, lacking detailed visual context. Street view imagery, which contains rich visual information and has the potential to reveal numerous environmental details, remains inaccessible to BLV people. In this work, we introduce SceneScout, a multimodal large language model (MLLM)-driven AI agent that enables accessible interactions with street view imagery. SceneScout supports two modes: (1) Route Preview, enabling users to familiarize themselves with visual details along a route, and (2) Virtual Exploration, enabling free movement within street view imagery. Our user study (N=10) demonstrates that SceneScout helps BLV users uncover visual information otherwise unavailable through existing means. A technical evaluation shows that most descriptions are accurate (72%) and describe stable visual elements (95%) even in older imagery, though occasional subtle and plausible errors make them difficult to verify without sight. We discuss future opportunities and challenges of using street view imagery to enhance navigation experiences.

Problem

Research questions and friction points this paper is trying to address.

Enabling BLV users to access street view imagery for pre-travel familiarization

Providing detailed visual context beyond landmarks and turn-by-turn instructions

Addressing inaccessibility of rich visual information in street view for BLV

Innovation

Methods, ideas, or system contributions that make the work stand out.

MLLM-driven AI agent for street view

Route Preview for visual details

Virtual Exploration in street imagery

🔎 Similar Papers

Investigating Use Cases of AI-Powered Scene Description Applications for Blind and Low Vision People