Ego-Grounding for Personalized Question-Answering in Egocentric Videos

📅 2026-04-02
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the challenge that models struggle to understand and reason about the identity and past experiences of camera wearers in egocentric videos. To this end, it introduces the novel task of “ego-grounding” and presents MyEgo, the first personalized question-answering dataset comprising 541 long-form egocentric videos and 5,000 questions. The study systematically evaluates state-of-the-art multimodal large language models (MLLMs) on tasks involving “my objects,” “my activities,” and “my past.” Results reveal that even leading models such as GPT-5 and Qwen3-VL achieve only 46% and 36% accuracy, respectively—substantially below human performance. While explicitly providing supporting evidence improves model accuracy, this gain diminishes rapidly over time, highlighting a critical limitation in current MLLMs’ capacity for long-term self-referential memory and reasoning.
📝 Abstract
We present the first systematic analysis of multimodal large language models (MLLMs) in personalized question-answering requiring ego-grounding - the ability to understand the camera-wearer in egocentric videos. To this end, we introduce MyEgo, the first egocentric VideoQA dataset designed to evaluate MLLMs' ability to understand, remember, and reason about the camera wearer. MyEgo comprises 541 long videos and 5K personalized questions asking about "my things", "my activities", and "my past". Benchmarking reveals that competitive MLLMs across variants, including open-source vs. proprietary, thinking vs. non-thinking, small vs. large scales all struggle on MyEgo. Top closed- and open-source models (e.g., GPT-5 and Qwen3-VL) achieve only~46% and 36% accuracy, trailing human performance by near 40% and 50% respectively. Surprisingly, neither explicit reasoning nor model scaling yield consistent improvements. Models improve when relevant evidence is explicitly provided, but gains drop over time, indicating limitations in tracking and remembering "me" and "my past". These findings collectively highlight the crucial role of ego-grounding and long-range memory in enabling personalized QA in egocentric videos. We hope MyEgo and our analyses catalyze further progress in these areas for egocentric personalized assistance. Data and code are available at https://github.com/Ryougetsu3606/MyEgo
Problem

Research questions and friction points this paper is trying to address.

ego-grounding
personalized question-answering
egocentric videos
VideoQA
long-range memory
Innovation

Methods, ideas, or system contributions that make the work stand out.

ego-grounding
egocentric video
personalized question answering
multimodal large language models
long-range memory
🔎 Similar Papers
No similar papers found.