🤖 AI Summary
Existing active AR agents rely on explicit voice interaction, causing environmental disturbance and social discomfort. This work proposes a seamless interaction framework that leverages egocentric vision and multimodal sensing to infer user context in real time, enabling large multimodal models (LMMs) to dynamically decide *when* and *how* to deliver assistance—thereby achieving low-intrusion, contextually appropriate proactive support. The system is deployed on XR headsets with on-device real-time inference. A user study demonstrates that, compared to voice-triggered baselines, our approach significantly reduces interaction burden (*p* < 0.01), improves usability by 32%, and achieves an 89% user preference rate. Our core contribution is the first integration of context-driven, multimodal temporal decision-making into active AR interaction—establishing a truly natural, implicit, and scalable intelligent assistance paradigm.
📝 Abstract
Proactive AR agents promise context-aware assistance, but their interactions often rely on explicit voice prompts or responses, which can be disruptive or socially awkward. We introduce Sensible Agent, a framework designed for unobtrusive interaction with these proactive agents. Sensible Agent dynamically adapts both "what" assistance to offer and, crucially, "how" to deliver it, based on real-time multimodal context sensing. Informed by an expert workshop (n=12) and a data annotation study (n=40), the framework leverages egocentric cameras, multimodal sensing, and Large Multimodal Models (LMMs) to infer context and suggest appropriate actions delivered via minimally intrusive interaction modes. We demonstrate our prototype on an XR headset through a user study (n=10) in both AR and VR scenarios. Results indicate that Sensible Agent significantly reduces perceived interaction effort compared to voice-prompted baseline, while maintaining high usability and achieving higher preference.