🤖 AI Summary
In XR environments, pure text or speech inadequately conveys 3D spatiotemporal information, resulting in low interaction efficiency and high cognitive load between users and LLM-based collaborators. To address this, we propose GesPrompt—the first multimodal XR interface integrating co-speech gestures with speech. It introduces a novel gesture–speech joint prompting paradigm that automatically extracts spatial and temporal references. We design an end-to-end gesture-augmented LLM interaction workflow and implement a functional VR prototype in a Unity–Python architecture, integrating Meta Quest hand tracking, ASR, spatiotemporal alignment modeling, and LLM prompt engineering. A user study demonstrates that GesPrompt improves instruction accuracy by 37%, reduces average task completion time by 42%, and decreases NASA-TLX cognitive load by 31% compared to a speech-only baseline.
📝 Abstract
Large Language Model (LLM)-based copilots have shown great potential in Extended Reality (XR) applications. However, the user faces challenges when describing the 3D environments to the copilots due to the complexity of conveying spatial-temporal information through text or speech alone. To address this, we introduce GesPrompt, a multimodal XR interface that combines co-speech gestures with speech, allowing end-users to communicate more naturally and accurately with LLM-based copilots in XR environments. By incorporating gestures, GesPrompt extracts spatial-temporal reference from co-speech gestures, reducing the need for precise textual prompts and minimizing cognitive load for end-users. Our contributions include (1) a workflow to integrate gesture and speech input in the XR environment, (2) a prototype VR system that implements the workflow, and (3) a user study demonstrating its effectiveness in improving user communication in VR environments.