🤖 AI Summary
Existing affective computing research predominantly relies on single-modality data, lacking high-synchrony, multi-turn multimodal benchmark datasets for stress response analysis. Method: We introduce the first synchronized multimodal stress dataset, concurrently capturing facial video and nine physiological signals—including heart rate, electrodermal activity, and skin temperature—using a high-frame-rate camera and medical-grade wearable sensors (Empatica E4). Data were collected from 20 participants across 26 hours of ecologically valid stress-inducing scenarios, with rigorous temporal alignment, artifact correction, and multi-source signal-to-noise ratio validation to ensure high fidelity and cross-session consistency. Contribution/Results: A ResNet+LSTM fusion model trained on this dataset achieves 89.3% accuracy in stress-level classification, significantly outperforming unimodal baselines. This work establishes the first high-quality, cross-modal benchmark for stress recognition, addressing a critical gap in affective computing and multimodal behavioral physiology.
📝 Abstract
Affective computing has garnered researchers' attention and interest in recent years as there is a need for AI systems to better understand and react to human emotions. However, analyzing human emotions, such as mood or stress, is quite complex. While various stress studies use facial expressions and wearables, most existing datasets rely on processing data from a single modality. This paper presents EmpathicSchool, a novel dataset that captures facial expressions and the associated physiological signals, such as heart rate, electrodermal activity, and skin temperature, under different stress levels. The data was collected from 20 participants at different sessions for 26 hours. The data includes nine different signal types, including both computer vision and physiological features that can be used to detect stress. In addition, various experiments were conducted to validate the signal quality.