๐ค AI Summary
This work addresses the inefficiency and high cognitive load inherent in remote voice-guided tasks, which often stem from the lack of spatial deixis and necessitate repeated verbal refinements. To overcome this limitation, the authors propose the Speech-to-Spatial frameworkโthe first approach to resolve referential ambiguity using only speech input, without relying on gestures, eye tracking, or manual annotations. By analyzing four common patterns of spoken spatial references, the method constructs an object-centric relational graph to anchor utterances in 3D space and generates persistent, in-situ augmented reality (AR) visual cues within a shared live view, accurately mapping spoken instructions to physical targets. Experimental results demonstrate that, compared to a voice-only baseline, the proposed system significantly improves task efficiency, reduces cognitive load, and enhances both usability and interpretability in remote guidance scenarios.
๐ Abstract
We introduce Speech-to-Spatial, a referent disambiguation framework that converts verbal remote-assistance instructions into spatially grounded AR guidance. Unlike prior systems that rely on additional cues (e.g., gesture, gaze) or manual expert annotations, Speech-to-Spatial infers the intended target solely from spoken references (speech input). Motivated by our formative study of speech referencing patterns, we characterize recurring ways people specify targets (Direct Attribute, Relational, Remembrance, and Chained) and ground them to our object-centric relational graph. Given an utterance, referent cues are parsed and rendered as persistent in-situ AR visual guidance, reducing iterative micro-guidance ("a bit more to the right","now, stop.") during remote guidance. We demonstrate the use cases of our system with remote guided assistance and intent disambiguation scenarios. Our evaluation shows that Speechto-Spatial improves task efficiency, reduces cognitive load, and enhances usability compared to a conventional voice-only baseline, transforming disembodied verbal instruction into visually explainable, actionable guidance on a live shared view.