🤖 AI Summary
This work addresses the limitations of current audio navigation systems, which often lack environmental context and directional awareness, leading to user disorientation. The authors propose a novel approach that integrates a vision-language model (VLM) with real-time spatial audio to deliver context-aware navigation instructions grounded in recognized environmental landmarks. When users deviate from their intended heading, the system provides immediate corrective feedback through directional spatial audio cues. This is the first system to combine VLM-based environmental understanding with real-time, corrective spatial audio, significantly enhancing users’ sense of direction. A user study (n=12) demonstrates that the proposed method substantially reduces path deviations compared to both VLM-only and Google Maps audio navigation baselines, with participants consistently reporting the directional cues as effective and the overall experience as superior.
📝 Abstract
Audio-only walking navigation can leave users disoriented, relying on vague cardinal directions and lacking real-time environmental context, leading to frequent errors. To address this, we present a novel system that integrates a Vision Language Model (VLM) with a spatial audio cue. Our system extracts environmental landmarks to anchor navigation instructions and, crucially, provides a directional spatial audio signal when the user faces the wrong direction, indicating the precise turn direction. In a user study (n=12), the spatial audio cue with VLM reduced route deviations compared to both VLM-only and Google Maps (audio-only) baseline systems. Users reported that the spatial audio cue effectively supported orientation and that landmark-anchored instructions provided a better navigation experience over audio-only Google Maps. This work serves as an initial look at the utility of future audio-only navigation systems for incorporating directional cues, especially real-time corrective spatial audio.