🤖 AI Summary
Existing emotion recognition methods suffer from key limitations: overt behavioral cues (e.g., facial expressions, speech) are easily feigned; physiological signals require invasive instrumentation; and gaze analysis often neglects environmental context. To address these, we propose a non-intrusive, continuous emotion recognition paradigm leveraging only a standard high-definition camera to simultaneously capture naturalistic gaze trajectories and head motion. For the first time, our approach deeply integrates gaze dynamics, environmental semantics, and temporal evolution into a unified spatial–semantic–temporal behavioral model. It operates implicitly—requiring no user cooperation or specialized sensors—to decode affective states. Experimental results demonstrate high robustness, real-time performance, low cost, and strong scalability in unconstrained real-world settings. This work advances computational affective science by formalizing emotion as an emergent product of human–environment interaction, offering a novel, scalable, and ecologically valid framework for implicit affective computing.
📝 Abstract
Emotion recognition,as a step toward mind reading,seeks to infer internal states from external cues.Most existing methods rely on explicit signals-such as facial expressions,speech,or gestures-that reflect only bodily responses and overlook the influence of environmental context.These cues are often voluntary,easy to mask,and insufficient for capturing deeper,implicit emotions. Physiological signal-based approaches offer more direct access to internal states but require complex sensors that compromise natural behavior and limit scalability.Gaze-based methods typically rely on static fixation analysis and fail to capture the rich,dynamic interactions between gaze and the environment,and thus cannot uncover the deep connection between emotion and implicit behavior.To address these limitations,we propose a novel camera-based,user-unaware emotion recognition approach that integrates gaze fixation patterns with environmental semantics and temporal dynamics.Leveraging standard HD cameras,our method unobtrusively captures users'eye appearance and head movements in natural settings-without the need for specialized hardware or active user participation.From these visual cues,the system estimates gaze trajectories over time and space, providing the basis for modeling the spatial, semantic,and temporal dimensions of gaze behavior. This allows us to capture the dynamic interplay between visual attention and the surrounding environment,revealing that emotions are not merely physiological responses but complex outcomes of human-environment interactions.The proposed approach enables user-unaware,real-time,and continuous emotion recognition,offering high generalizability and low deployment cost.