"How to Explore Biases in Speech Emotion AI with Users?" A Speech-Emotion-Acting Study Exploring Age and Language Biases

📅 2025-07-16
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This study addresses the underexamined dual biases of age (adolescents vs. adults aged 55+) and language (Danish vs. English) in speech emotion recognition (SER). We propose a goal-directed behavioral framework that employs a visual emotion-guidance interface to elicit controlled emotional speech, integrates valence-arousal space mapping, and logs real-time human–machine intent discrepancies to quantify emotional misalignment risk. A custom system enables cross-lingual, cross-age real-time prediction and logging. Experiments show model robustness across age and language dimensions—no statistically significant performance differences—yet reveal systematic limitations in recognizing high-arousal emotions. Our key contribution is the formal incorporation of intent alignment into the SER evaluation paradigm, shifting focus from accuracy-centric metrics toward inclusive modeling grounded in user experience and affective semantic alignment.

Technology Category

Application Category

📝 Abstract
This study explores how age and language shape the deliberate vocal expression of emotion, addressing underexplored user groups, Teenagers (N = 12) and Adults 55+ (N = 12), within speech emotion recognition (SER). While most SER systems are trained on spontaneous, monolingual English data, our research evaluates how such models interpret intentionally performed emotional speech across age groups and languages (Danish and English). To support this, we developed a novel experimental paradigm combining a custom user interface with a backend for real-time SER prediction and data logging. Participants were prompted to hit visual targets in valence-arousal space by deliberately expressing four emotion targets. While limitations include some reliance on self-managed voice recordings and inconsistent task execution, the results suggest contrary to expectations, no significant differences between language or age groups, and a degree of cross-linguistic and age robustness in model interpretation. Though some limitations in high-arousal emotion recognition were evident. Our qualitative findings highlight the need to move beyond system-centered accuracy metrics and embrace more inclusive, human-centered SER models. By framing emotional expression as a goal-directed act and logging the real-time gap between human intent and machine interpretation, we expose the risks of affective misalignment.
Problem

Research questions and friction points this paper is trying to address.

Explores age and language biases in speech emotion AI models
Evaluates SER model performance across age groups and languages
Highlights need for human-centered SER models beyond accuracy metrics
Innovation

Methods, ideas, or system contributions that make the work stand out.

Novel experimental paradigm with custom UI
Real-time SER prediction and data logging
Goal-directed emotional expression framework
🔎 Similar Papers
No similar papers found.
J
Josephine Beatrice Skovbo Borre
Aalborg University, Denmark
Malene Gorm Wold
Malene Gorm Wold
Aalborg University, Denmark
S
Sara Kjær Rasmussen
Aalborg University, Denmark
Ilhan Aslan
Ilhan Aslan
Associate Professor, Aalborg University
Intelligent User InterfacesHuman-Computer InteractionHuman-Centered AI