🤖 AI Summary
Visually impaired individuals face a fundamental trade-off between precise spatial guidance and open-world object recognition in unstructured environments: existing solutions either rely on pre-scanned scenes or restricted object categories, or lack fine-grained spatial feedback. This paper introduces the first end-to-end mobile system that jointly enables open-vocabulary object recognition and real-time spatial navigation. It leverages vision-language large models to interpret natural-language target descriptions, integrates LiDAR-based depth sensing and AR-enabled spatial mapping, and delivers multimodal (audio-tactile) feedback for centimeter-accurate guidance—without requiring prior environment setup or object category constraints. Its key innovation lies in unifying open-world perception with high-precision spatial reasoning within a lightweight mobile architecture. In user studies with 12 blind and low-vision participants, the system significantly reduced object retrieval time and achieved higher user preference than state-of-the-art baselines, demonstrating both efficacy and practical viability.
📝 Abstract
People with visual impairments often face significant challenges in locating and retrieving objects in their surroundings. Existing assistive technologies present a trade-off: systems that offer precise guidance typically require pre-scanning or support only fixed object categories, while those with open-world object recognition lack spatial feedback for reaching the object. To address this gap, we introduce 'NaviSense', a mobile assistive system that combines conversational AI, vision-language models, augmented reality (AR), and LiDAR to support open-world object detection with real-time audio-haptic guidance. Users specify objects via natural language and receive continuous spatial feedback to navigate toward the target without needing prior setup. Designed with insights from a formative study and evaluated with 12 blind and low-vision participants, NaviSense significantly reduced object retrieval time and was preferred over existing tools, demonstrating the value of integrating open-world perception with precise, accessible guidance.