From Speech-to-Spatial: Grounding Utterances on A Live Shared View with Augmented Reality

📅 2026-02-03

📈 Citations: 0

✨ Influential: 0

career value

192K/year

🤖 AI Summary

This work addresses the inefficiency and high cognitive load inherent in remote voice-guided tasks, which often stem from the lack of spatial deixis and necessitate repeated verbal refinements. To overcome this limitation, the authors propose the Speech-to-Spatial framework—the first approach to resolve referential ambiguity using only speech input, without relying on gestures, eye tracking, or manual annotations. By analyzing four common patterns of spoken spatial references, the method constructs an object-centric relational graph to anchor utterances in 3D space and generates persistent, in-situ augmented reality (AR) visual cues within a shared live view, accurately mapping spoken instructions to physical targets. Experimental results demonstrate that, compared to a voice-only baseline, the proposed system significantly improves task efficiency, reduces cognitive load, and enhances both usability and interpretability in remote guidance scenarios.

Technology Category

Application Category

📝 Abstract

We introduce Speech-to-Spatial, a referent disambiguation framework that converts verbal remote-assistance instructions into spatially grounded AR guidance. Unlike prior systems that rely on additional cues (e.g., gesture, gaze) or manual expert annotations, Speech-to-Spatial infers the intended target solely from spoken references (speech input). Motivated by our formative study of speech referencing patterns, we characterize recurring ways people specify targets (Direct Attribute, Relational, Remembrance, and Chained) and ground them to our object-centric relational graph. Given an utterance, referent cues are parsed and rendered as persistent in-situ AR visual guidance, reducing iterative micro-guidance ("a bit more to the right","now, stop.") during remote guidance. We demonstrate the use cases of our system with remote guided assistance and intent disambiguation scenarios. Our evaluation shows that Speechto-Spatial improves task efficiency, reduces cognitive load, and enhances usability compared to a conventional voice-only baseline, transforming disembodied verbal instruction into visually explainable, actionable guidance on a live shared view.

Problem

Research questions and friction points this paper is trying to address.

referent disambiguation

speech grounding

augmented reality

remote assistance

spatial referencing

Innovation

Methods, ideas, or system contributions that make the work stand out.

Speech-to-Spatial

referent disambiguation

augmented reality