Benchmarking Egocentric Multimodal Goal Inference for Assistive Wearable Agents

📅 2025-10-25

📈 Citations: 0

✨ Influential: 0

career value

198K/year

🤖 AI Summary

This work addresses the challenge of implicitly inferring user goals from multimodal contextual signals—vision, audio, digital sensor data, and long-term memory—to reduce wearable assistive agents’ reliance on explicit user interaction. Method: We introduce WAGIBench, the first benchmark for goal inference in wearable agent settings, comprising 29 hours of real-world first-person multimodal data. We propose a vision-language model (VLM)-based multimodal goal inference framework and conduct ablation studies to quantify each modality’s contribution. Evaluation includes multiple-choice accuracy and generative relevance scoring. Results: Human performance achieves 93% accuracy on the multiple-choice task, while the state-of-the-art (SOTA) model attains only 84%; in generative evaluation, SOTA models produce semantically relevant goals for just 55% of instances. Our core contributions are: (1) establishing the first wearable-specific goal inference benchmark; (2) revealing a substantial human–machine gap in implicit intention understanding; and (3) proposing a new paradigm and evaluation standard for interaction-free, embodied goal perception.

Technology Category

Application Category

📝 Abstract

There has been a surge of interest in assistive wearable agents: agents embodied in wearable form factors (e.g., smart glasses) who take assistive actions toward a user's goal/query (e.g. "Where did I leave my keys?"). In this work, we consider the important complementary problem of inferring that goal from multi-modal contextual observations. Solving this "goal inference" problem holds the promise of eliminating the effort needed to interact with such an agent. This work focuses on creating WAGIBench, a strong benchmark to measure progress in solving this problem using vision-language models (VLMs). Given the limited prior work in this area, we collected a novel dataset comprising 29 hours of multimodal data from 348 participants across 3,477 recordings, featuring ground-truth goals alongside accompanying visual, audio, digital, and longitudinal contextual observations. We validate that human performance exceeds model performance, achieving 93% multiple-choice accuracy compared with 84% for the best-performing VLM. Generative benchmark results that evaluate several families of modern vision-language models show that larger models perform significantly better on the task, yet remain far from practical usefulness, as they produce relevant goals only 55% of the time. Through a modality ablation, we show that models benefit from extra information in relevant modalities with minimal performance degradation from irrelevant modalities.

Problem

Research questions and friction points this paper is trying to address.

Benchmarking goal inference from multimodal contextual observations

Creating WAGIBench to evaluate vision-language models' performance

Addressing limited prior work with novel multimodal dataset collection

Innovation

Methods, ideas, or system contributions that make the work stand out.

Created WAGIBench benchmark using vision-language models

Collected multimodal dataset from 348 participants

Showed models benefit from relevant modality information

🔎 Similar Papers

No similar papers found.