🤖 AI Summary
This work addresses the challenge of implicitly inferring user goals from multimodal contextual signals—vision, audio, digital sensor data, and long-term memory—to reduce wearable assistive agents’ reliance on explicit user interaction.
Method: We introduce WAGIBench, the first benchmark for goal inference in wearable agent settings, comprising 29 hours of real-world first-person multimodal data. We propose a vision-language model (VLM)-based multimodal goal inference framework and conduct ablation studies to quantify each modality’s contribution. Evaluation includes multiple-choice accuracy and generative relevance scoring.
Results: Human performance achieves 93% accuracy on the multiple-choice task, while the state-of-the-art (SOTA) model attains only 84%; in generative evaluation, SOTA models produce semantically relevant goals for just 55% of instances. Our core contributions are: (1) establishing the first wearable-specific goal inference benchmark; (2) revealing a substantial human–machine gap in implicit intention understanding; and (3) proposing a new paradigm and evaluation standard for interaction-free, embodied goal perception.
📝 Abstract
There has been a surge of interest in assistive wearable agents: agents embodied in wearable form factors (e.g., smart glasses) who take assistive actions toward a user's goal/query (e.g. "Where did I leave my keys?"). In this work, we consider the important complementary problem of inferring that goal from multi-modal contextual observations. Solving this "goal inference" problem holds the promise of eliminating the effort needed to interact with such an agent. This work focuses on creating WAGIBench, a strong benchmark to measure progress in solving this problem using vision-language models (VLMs). Given the limited prior work in this area, we collected a novel dataset comprising 29 hours of multimodal data from 348 participants across 3,477 recordings, featuring ground-truth goals alongside accompanying visual, audio, digital, and longitudinal contextual observations. We validate that human performance exceeds model performance, achieving 93% multiple-choice accuracy compared with 84% for the best-performing VLM. Generative benchmark results that evaluate several families of modern vision-language models show that larger models perform significantly better on the task, yet remain far from practical usefulness, as they produce relevant goals only 55% of the time. Through a modality ablation, we show that models benefit from extra information in relevant modalities with minimal performance degradation from irrelevant modalities.