From Speech-to-Spatial: Grounding Utterances on A Live Shared View with Augmented Reality

๐Ÿ“… 2026-02-03
๐Ÿ“ˆ Citations: 0
โœจ Influential: 0
๐Ÿ“„ PDF
๐Ÿค– AI Summary
This work addresses the inefficiency and high cognitive load inherent in remote voice-guided tasks, which often stem from the lack of spatial deixis and necessitate repeated verbal refinements. To overcome this limitation, the authors propose the Speech-to-Spatial frameworkโ€”the first approach to resolve referential ambiguity using only speech input, without relying on gestures, eye tracking, or manual annotations. By analyzing four common patterns of spoken spatial references, the method constructs an object-centric relational graph to anchor utterances in 3D space and generates persistent, in-situ augmented reality (AR) visual cues within a shared live view, accurately mapping spoken instructions to physical targets. Experimental results demonstrate that, compared to a voice-only baseline, the proposed system significantly improves task efficiency, reduces cognitive load, and enhances both usability and interpretability in remote guidance scenarios.

Technology Category

Application Category

๐Ÿ“ Abstract
We introduce Speech-to-Spatial, a referent disambiguation framework that converts verbal remote-assistance instructions into spatially grounded AR guidance. Unlike prior systems that rely on additional cues (e.g., gesture, gaze) or manual expert annotations, Speech-to-Spatial infers the intended target solely from spoken references (speech input). Motivated by our formative study of speech referencing patterns, we characterize recurring ways people specify targets (Direct Attribute, Relational, Remembrance, and Chained) and ground them to our object-centric relational graph. Given an utterance, referent cues are parsed and rendered as persistent in-situ AR visual guidance, reducing iterative micro-guidance ("a bit more to the right","now, stop.") during remote guidance. We demonstrate the use cases of our system with remote guided assistance and intent disambiguation scenarios. Our evaluation shows that Speechto-Spatial improves task efficiency, reduces cognitive load, and enhances usability compared to a conventional voice-only baseline, transforming disembodied verbal instruction into visually explainable, actionable guidance on a live shared view.
Problem

Research questions and friction points this paper is trying to address.

referent disambiguation
speech grounding
augmented reality
remote assistance
spatial referencing
Innovation

Methods, ideas, or system contributions that make the work stand out.

Speech-to-Spatial
referent disambiguation
augmented reality
spatial grounding
voice-only guidance
๐Ÿ”Ž Similar Papers
No similar papers found.