🤖 AI Summary
Current AR personal assistants suffer from opacity, non-traceability, and poor cross-scenario adaptability. To address these challenges, this paper proposes the first end-to-end transparent, interpretable, and multimodal AR task-guidance system. Methodologically, it integrates computer vision, multimodal perception, attention modeling, and real-time semantic reasoning to achieve full-chain interpretability and data traceability across perception, reasoning, and interaction in AR. A unified data flow architecture and visual debugging interface are introduced to enable rapid domain-specific customization. Experiments on multiple real-world tasks demonstrate significant improvements in operational accuracy and user trust; fault detection latency is under 200 ms, and debugging efficiency increases by 60%. The core contribution lies in establishing a systematic paradigm for realizing transparency and interpretability in AR agents—bridging theoretical principles with deployable, auditable, and maintainable AR intelligence.
📝 Abstract
The concept of an AI assistant for task guidance is rapidly shifting from a science fiction staple to an impending reality. Such a system is inherently complex, requiring models for perceptual grounding, attention, and reasoning, an intuitive interface that adapts to the performer's needs, and the orchestration of data streams from many sensors. Moreover, all data acquired by the system must be readily available for post-hoc analysis to enable developers to understand performer behavior and quickly detect failures. We introduce TIM, the first end-to-end AI-enabled task guidance system in augmented reality which is capable of detecting both the user and scene as well as providing adaptable, just-in-time feedback. We discuss the system challenges and propose design solutions. We also demonstrate how TIM adapts to domain applications with varying needs, highlighting how the system components can be customized for each scenario.