Speech Emotion Recognition with Phonation Excitation Information and Articulatory Kinematics

📅 2025-11-11
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing speech emotion recognition (SER) systems predominantly rely on audio and text modalities, overlooking the intrinsic role of physiological mechanisms—such as vocal fold excitation and articulatory dynamics—in emotional expression; moreover, the scarcity of labeled physiological data hinders practical deployment. Method: This work introduces, for the first time, synchronized electroglottographic (EGG) and electromagnetic articulography (EMA) recordings to capture vocal-fold and articulatory dynamics, enabling construction of STEM-E2VA—the first multimodal SER dataset with fine-grained physiological annotations. We further propose a speech-driven physiological feature inversion framework that estimates key physiological dynamics without physical sensors, and design a physiology-audio fusion model for end-to-end emotion classification. Results: Experiments demonstrate significant accuracy improvements when incorporating physiological information; the inversion method exhibits robustness and practical utility in real-world scenarios lacking ground-truth physiological signals, validating the unique representational power of speech production mechanisms in SER.

Technology Category

Application Category

📝 Abstract
Speech emotion recognition (SER) has advanced significantly for the sake of deep-learning methods, while textual information further enhances its performance. However, few studies have focused on the physiological information during speech production, which also encompasses speaker traits, including emotional states. To bridge this gap, we conducted a series of experiments to investigate the potential of the phonation excitation information and articulatory kinematics for SER. Due to the scarcity of training data for this purpose, we introduce a portrayed emotional dataset, STEM-E2VA, which includes audio and physiological data such as electroglottography (EGG) and electromagnetic articulography (EMA). EGG and EMA provide information of phonation excitation and articulatory kinematics, respectively. Additionally, we performed emotion recognition using estimated physiological data derived through inversion methods from speech, instead of collected EGG and EMA, to explore the feasibility of applying such physiological information in real-world SER. Experimental results confirm the effectiveness of incorporating physiological information about speech production for SER and demonstrate its potential for practical use in real-world scenarios.
Problem

Research questions and friction points this paper is trying to address.

Incorporating physiological speech production data for emotion recognition
Exploring phonation excitation and articulatory kinematics in SER
Validating estimated physiological features for real-world applications
Innovation

Methods, ideas, or system contributions that make the work stand out.

Using phonation excitation information from electroglottography data
Leveraging articulatory kinematics from electromagnetic articulography data
Estimating physiological features through inversion methods from speech
🔎 Similar Papers
No similar papers found.
Ziqian Zhang
Ziqian Zhang
Nanjing University
Reinforcement LearningMulti-Agent Reinforcement Learning
M
Min Huang
School of Optoelectronic Science and Engineering, Soochow University, Suzhou, China
Z
Zhongzhe Xiao
School of Optoelectronic Science and Engineering, Soochow University, Suzhou, China