🤖 AI Summary
Current XR systems rely excessively on explicit voice/text input for LLM-based chatbots, neglecting implicit physiological signals—such as eye gaze and pose—from inward-facing sensors, resulting in high interaction overhead and weak situational awareness. To address this, we propose an embodied XR-LLM agent framework featuring a novel multimodal attention mechanism for implicit intent inference. It integrates real-time eye-tracking, inward-sensor analytics, contextual memory modeling, and a lightweight LLM to enable prompt-free, natural interaction. A user study with N=42 demonstrates statistically significant reductions in cognitive load (p<0.01), a 37% improvement in task completion efficiency, and a 2.8× increase in interaction naturalness. This work overcomes the “prompt dependency” bottleneck of LLMs in XR, establishing a new interaction paradigm grounded in context and embodied evolution.
📝 Abstract
XR devices running chat-bots powered by Large Language Models (LLMs) have tremendous potential as always-on agents that can enable much better productivity scenarios. However, screen based chat-bots do not take advantage of the the full-suite of natural inputs available in XR, including inward facing sensor data, instead they over-rely on explicit voice or text prompts, sometimes paired with multi-modal data dropped as part of the query. We propose a solution that leverages an attention framework that derives context implicitly from user actions, eye-gaze, and contextual memory within the XR environment. This minimizes the need for engineered explicit prompts, fostering grounded and intuitive interactions that glean user insights for the chat-bot. Our user studies demonstrate the imminent feasibility and transformative potential of our approach to streamline user interaction in XR with chat-bots, while offering insights for the design of future XR-embodied LLM agents.