Benchmarking Egocentric Multimodal Goal Inference for Assistive Wearable Agents

📅 2025-10-25
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the challenge of implicitly inferring user goals from multimodal contextual signals—vision, audio, digital sensor data, and long-term memory—to reduce wearable assistive agents’ reliance on explicit user interaction. Method: We introduce WAGIBench, the first benchmark for goal inference in wearable agent settings, comprising 29 hours of real-world first-person multimodal data. We propose a vision-language model (VLM)-based multimodal goal inference framework and conduct ablation studies to quantify each modality’s contribution. Evaluation includes multiple-choice accuracy and generative relevance scoring. Results: Human performance achieves 93% accuracy on the multiple-choice task, while the state-of-the-art (SOTA) model attains only 84%; in generative evaluation, SOTA models produce semantically relevant goals for just 55% of instances. Our core contributions are: (1) establishing the first wearable-specific goal inference benchmark; (2) revealing a substantial human–machine gap in implicit intention understanding; and (3) proposing a new paradigm and evaluation standard for interaction-free, embodied goal perception.

Technology Category

Application Category

📝 Abstract
There has been a surge of interest in assistive wearable agents: agents embodied in wearable form factors (e.g., smart glasses) who take assistive actions toward a user's goal/query (e.g. "Where did I leave my keys?"). In this work, we consider the important complementary problem of inferring that goal from multi-modal contextual observations. Solving this "goal inference" problem holds the promise of eliminating the effort needed to interact with such an agent. This work focuses on creating WAGIBench, a strong benchmark to measure progress in solving this problem using vision-language models (VLMs). Given the limited prior work in this area, we collected a novel dataset comprising 29 hours of multimodal data from 348 participants across 3,477 recordings, featuring ground-truth goals alongside accompanying visual, audio, digital, and longitudinal contextual observations. We validate that human performance exceeds model performance, achieving 93% multiple-choice accuracy compared with 84% for the best-performing VLM. Generative benchmark results that evaluate several families of modern vision-language models show that larger models perform significantly better on the task, yet remain far from practical usefulness, as they produce relevant goals only 55% of the time. Through a modality ablation, we show that models benefit from extra information in relevant modalities with minimal performance degradation from irrelevant modalities.
Problem

Research questions and friction points this paper is trying to address.

Benchmarking goal inference from multimodal contextual observations
Creating WAGIBench to evaluate vision-language models' performance
Addressing limited prior work with novel multimodal dataset collection
Innovation

Methods, ideas, or system contributions that make the work stand out.

Created WAGIBench benchmark using vision-language models
Collected multimodal dataset from 348 participants
Showed models benefit from relevant modality information
🔎 Similar Papers
No similar papers found.
V
Vijay Veerabadran
Meta Reality Labs
Fanyi Xiao
Fanyi Xiao
Meta AI
Computer VisionMachine Learning
Nitin Kamra
Nitin Kamra
Meta Reality Labs
P
Pedro Matias
Meta Reality Labs
J
Joy Chen
Meta FAIR
C
Caley Drooff
Meta Reality Labs
B
Brett D Roads
Meta Reality Labs
R
Riley Williams
Meta Reality Labs
E
Ethan Henderson
Meta Reality Labs
X
Xuanyi Zhao
Meta FAIR
Kevin Carlberg
Kevin Carlberg
Meta Reality Labs
Joseph Tighe
Joseph Tighe
Meta
Human UnderstandingAction RecognitionDetectionVideo
Karl Ridgeway
Karl Ridgeway
Facebook
Factorial RepresentationsFew-shot learningdeep embeddings