Watching Movies Like a Human: Egocentric Emotion Understanding for Embodied Companions

📅 2026-04-17
📈 Citations: 0
Influential: 0
📄 PDF

career value

205K/year
🤖 AI Summary
This work addresses the significant performance degradation of existing movie emotion understanding models when applied to first-person egocentric viewing scenarios, where domain shifts—such as perspective distortion and lighting variations—challenge model robustness. To tackle this, we introduce EgoScreen-Emotion (ESE), the first benchmark dataset tailored for emotion recognition from a first-person screen-watching perspective, accompanied by a confidence-aware multi-label annotation protocol to handle emotional ambiguity. We further propose a multimodal long-context emotion reasoning framework that integrates visual temporal features, narrative summaries, compressed historical context, and audio cues. Through multimodal alignment and domain-adaptive training, our model achieves a substantial improvement in Macro-F1 score—from 16.69 to near the level of powerful closed-source multimodal models—demonstrating markedly enhanced cross-domain robustness under realistic egocentric viewing conditions.

Technology Category

Application Category

📝 Abstract
Embodied robotic agents often perceive movies through an egocentric screen-view interface rather than native cinematic footage, introducing domain shifts such as viewpoint distortion, scale variation, illumination changes, and environmental interference. However, existing research on movie emotion understanding is almost exclusively conducted on cinematic footage, limiting cross-domain generalization to real-world viewing scenarios. To bridge this gap, we introduce EgoScreen-Emotion (ESE), the first benchmark dataset for egocentric screen-view movie emotion understanding. ESE contains 224 movie trailers captured under controlled egocentric screen-view conditions, producing 28,667 temporally aligned key-frames annotated by multiple raters with a confidence-aware multi-label protocol to address emotional ambiguity. We further build a multimodal long-context emotion reasoning framework that models temporal visual evidence, narrative summaries, compressed historical context, and audio cues. Cross-domain experiments reveal a severe domain gap: models trained on cinematic footage drop from 27.99 to 16.69 Macro-F1 when evaluated on realistic egocentric screen-view observations. Training on ESE substantially improves robustness under realistic viewing conditions. Our approach achieves competitive performance compared with strong closed-source multimodal models, highlighting the importance of domain-specific data and long-context multimodal reasoning.
Problem

Research questions and friction points this paper is trying to address.

egocentric vision
emotion understanding
domain shift
embodied agents
movie trailers
Innovation

Methods, ideas, or system contributions that make the work stand out.

egocentric vision
emotion understanding
multimodal reasoning
domain shift
long-context modeling