🤖 AI Summary
Existing speech emotion recognition (SER) systems predominantly rely on audio and text modalities, overlooking the intrinsic role of physiological mechanisms—such as vocal fold excitation and articulatory dynamics—in emotional expression; moreover, the scarcity of labeled physiological data hinders practical deployment. Method: This work introduces, for the first time, synchronized electroglottographic (EGG) and electromagnetic articulography (EMA) recordings to capture vocal-fold and articulatory dynamics, enabling construction of STEM-E2VA—the first multimodal SER dataset with fine-grained physiological annotations. We further propose a speech-driven physiological feature inversion framework that estimates key physiological dynamics without physical sensors, and design a physiology-audio fusion model for end-to-end emotion classification. Results: Experiments demonstrate significant accuracy improvements when incorporating physiological information; the inversion method exhibits robustness and practical utility in real-world scenarios lacking ground-truth physiological signals, validating the unique representational power of speech production mechanisms in SER.
📝 Abstract
Speech emotion recognition (SER) has advanced significantly for the sake of deep-learning methods, while textual information further enhances its performance. However, few studies have focused on the physiological information during speech production, which also encompasses speaker traits, including emotional states. To bridge this gap, we conducted a series of experiments to investigate the potential of the phonation excitation information and articulatory kinematics for SER. Due to the scarcity of training data for this purpose, we introduce a portrayed emotional dataset, STEM-E2VA, which includes audio and physiological data such as electroglottography (EGG) and electromagnetic articulography (EMA). EGG and EMA provide information of phonation excitation and articulatory kinematics, respectively. Additionally, we performed emotion recognition using estimated physiological data derived through inversion methods from speech, instead of collected EGG and EMA, to explore the feasibility of applying such physiological information in real-world SER. Experimental results confirm the effectiveness of incorporating physiological information about speech production for SER and demonstrate its potential for practical use in real-world scenarios.