EgoIntrospect: An Egocentric Dataset and Benchmark for User-Centric Internal State Reasoning

📅 2026-05-17
📈 Citations: 0
Influential: 0
📄 PDF

career value

239K/year
🤖 AI Summary
Existing egocentric video datasets struggle to effectively capture users’ internal states—such as intent, emotion, and memory—thereby limiting the natural interaction capabilities of AI assistants. To address this gap, this work introduces EgoIntrospect, the first user-driven multimodal dataset, which synchronously collects video, audio, eye-tracking, motion, and physiological signals across devices and incorporates user-provided self-annotations to reveal subjective states during human–AI interactions. Leveraging this dataset, we establish a benchmark for internal state inference tailored to multimodal large language models. Experimental results demonstrate that current models still face significant challenges in accurately inferring internal states through effective fusion of multimodal signals. This study provides a comprehensive resource comprising 180 hours of data from 60 participants, along with an evaluation framework, thereby filling a critical void in the field.
📝 Abstract
Despite extensive efforts on egocentric video datasets and benchmarks, understanding users' internal states, which is crucial for enabling seamless AI assistant experiences, remains largely overlooked. In this work, we introduce EgoIntrospect, the first egocentric dataset captured in user-driven scenarios with self-annotations that explicitly reveal users' interactive intentions with AI assistants. EgoIntrospect was collected using a cross-device setup, providing synchronized video, audio, gaze, motion, and physiological signals. It consists of 180 hours of recordings from 60 subjects, with an average recording duration of 3 hours per subject. Leveraging EgoIntrospect, we formalize a suite of tasks centered on user internal states, including affective experience, interactive intent, and cognitive memory. We further process the annotations to construct benchmarks that evaluate the ability of modern multimodal large language models to reason about users' internal states from egocentric observations. Experiments on our benchmark suggest that existing multimodal large language models struggle to effectively leverage multimodal signals to infer users' subjective internal states. The dataset and annotations will be made publicly available to advance research in egocentric vision and wearable AI assistants. Project page: https://ego-introspect.github.io/
Problem

Research questions and friction points this paper is trying to address.

egocentric vision
internal state reasoning
user-centric AI
multimodal understanding
interactive intent
Innovation

Methods, ideas, or system contributions that make the work stand out.

egocentric vision
internal state reasoning
multimodal sensing
user intent annotation
wearable AI assistants