🤖 AI Summary
Continuous affect recognition in real-world scenarios faces two key challenges: (1) the scarcity of large-scale, densely annotated video datasets covering the full 2D affective space (valence and arousal), and (2) the difficulty of extracting facial video features that simultaneously achieve interpretability, robustness, high accuracy, and low computational overhead. To address these, we propose xTrace—a lightweight, real-time system introducing (i) the first training paradigm ensuring complete coverage of the 2D affective space, built upon ~450K diverse video segments; and (ii) a novel affect descriptor integrating facial action modeling, dimensional affect regression, and uncertainty-aware estimation. Evaluated on 50K in-the-wild videos, xTrace achieves a mean Concordance Correlation Coefficient (CCC) of 0.86 and a mean absolute error of 0.13—significantly outperforming MediaPipe, OpenFace, and the Augsburg Affect Toolbox. It demonstrates superior robustness to non-frontal poses, broad affect distribution, and calibrated confidence estimation.
📝 Abstract
Recognising expressive behaviours in face videos is a long-standing challenge in Affective Computing. Despite significant advancements in recent years, it still remains a challenge to build a robust and reliable system for naturalistic and in-the-wild facial expressive behaviour analysis in real time. This paper addresses two key challenges in building such a system: (1). The paucity of large-scale labelled facial affect video datasets with extensive coverage of the 2D emotion space, and (2). The difficulty of extracting facial video features that are discriminative, interpretable, robust, and computationally efficient. Toward addressing these challenges, we introduce xTrace, a robust tool for facial expressive behaviour analysis and predicting continuous values of dimensional emotions, namely valence and arousal, from in-the-wild face videos. To address challenge (1), our affect recognition model is trained on the largest facial affect video data set, containing ~450k videos that cover most emotion zones in the dimensional emotion space, making xTrace highly versatile in analysing a wide spectrum of naturalistic expressive behaviours. To address challenge (2), xTrace uses facial affect descriptors that are not only explainable, but can also achieve a high degree of accuracy and robustness with low computational complexity. The key components of xTrace are benchmarked against three existing tools: MediaPipe, OpenFace, and Augsburg Affect Toolbox. On an in-the-wild validation set composed of 50k videos, xTrace achieves 0.86 mean CCC and 0.13 mean absolute error values. We present a detailed error analysis of affect predictions from xTrace, illustrating (a). its ability to recognise emotions with high accuracy across most bins in the 2D emotion space, (b). its robustness to non-frontal head pose angles, and (c). a strong correlation between its uncertainty estimates and its accuracy.