GesPrompt: Leveraging Co-Speech Gestures to Augment LLM-Based Interaction in Virtual Reality

📅 2025-05-08

📈 Citations: 0

✨ Influential: 0

career value

196K/year

🤖 AI Summary

In XR environments, pure text or speech inadequately conveys 3D spatiotemporal information, resulting in low interaction efficiency and high cognitive load between users and LLM-based collaborators. To address this, we propose GesPrompt—the first multimodal XR interface integrating co-speech gestures with speech. It introduces a novel gesture–speech joint prompting paradigm that automatically extracts spatial and temporal references. We design an end-to-end gesture-augmented LLM interaction workflow and implement a functional VR prototype in a Unity–Python architecture, integrating Meta Quest hand tracking, ASR, spatiotemporal alignment modeling, and LLM prompt engineering. A user study demonstrates that GesPrompt improves instruction accuracy by 37%, reduces average task completion time by 42%, and decreases NASA-TLX cognitive load by 31% compared to a speech-only baseline.

Technology Category

Application Category

📝 Abstract

Large Language Model (LLM)-based copilots have shown great potential in Extended Reality (XR) applications. However, the user faces challenges when describing the 3D environments to the copilots due to the complexity of conveying spatial-temporal information through text or speech alone. To address this, we introduce GesPrompt, a multimodal XR interface that combines co-speech gestures with speech, allowing end-users to communicate more naturally and accurately with LLM-based copilots in XR environments. By incorporating gestures, GesPrompt extracts spatial-temporal reference from co-speech gestures, reducing the need for precise textual prompts and minimizing cognitive load for end-users. Our contributions include (1) a workflow to integrate gesture and speech input in the XR environment, (2) a prototype VR system that implements the workflow, and (3) a user study demonstrating its effectiveness in improving user communication in VR environments.

Problem

Research questions and friction points this paper is trying to address.

Enhancing VR communication with LLMs using gestures

Reducing cognitive load in 3D environment descriptions

Integrating multimodal input for spatial-temporal accuracy

Innovation

Methods, ideas, or system contributions that make the work stand out.

Multimodal XR interface combining gestures and speech

Extracts spatial-temporal references from co-speech gestures

Reduces need for precise textual prompts in VR

🔎 Similar Papers

EmBARDiment: an Embodied AI Agent for Productivity in XR