Cross-lingual Speech Emotion Recognition: Humans vs. Self-Supervised Models

📅 2024-09-25
🏛️ IEEE International Conference on Acoustics, Speech, and Signal Processing
📈 Citations: 1
Influential: 0
📄 PDF
🤖 AI Summary
This study investigates whether self-supervised speech models (e.g., Wav2Vec 2.0, HuBERT) can match human performance in cross-lingual speech emotion recognition (SER), systematically comparing human–machine disparities across monolingual, cross-lingual, and transfer-learning settings—first introducing utterance- and segment-level behavioral analysis. Methodologically, it integrates layer-wise attribution analysis, parameter-efficient fine-tuning (via Adapter/LoRA), cross-lingual transfer learning, and expert-annotated dialectal speech evaluation. Key contributions are: (1) revealing significant heterogeneity in how dialectal background absence affects human versus model SER performance; (2) demonstrating that adapted SSL models achieve near-native accuracy on target languages; and (3) identifying interpretable, emotion-category-specific behavioral divergences between humans and models—yielding dual cognitive and engineering insights for cross-lingual SER modeling.

Technology Category

Application Category

📝 Abstract
Utilizing Self-Supervised Learning (SSL) models for Speech Emotion Recognition (SER) has proven effective, yet limited research has explored cross-lingual scenarios. This study presents a comparative analysis between human performance and SSL models, beginning with a layer-wise analysis and an exploration of parameter-efficient fine-tuning strategies in monolingual, cross-lingual, and transfer learning contexts. We further compare the SER ability of models and humans at both utterance- and segment-levels. Additionally, we investigate the impact of dialect on cross-lingual SER through human evaluation. Our findings reveal that models, with appropriate knowledge transfer, can adapt to the target language and achieve performance comparable to native speakers. We also demonstrate the significant effect of dialect on SER for individuals without prior linguistic and paralinguistic background. Moreover, both humans and models exhibit distinct behaviors across different emotions. These results offer new insights into the cross-lingual SER capabilities of SSL models, underscoring both their similarities to and differences from human emotion perception.
Problem

Research questions and friction points this paper is trying to address.

Comparing human and SSL model performance in cross-lingual speech emotion recognition
Investigating dialect impact on cross-lingual emotion recognition accuracy
Analyzing utterance- and segment-level emotion detection differences between humans and models
Innovation

Methods, ideas, or system contributions that make the work stand out.

Utilizes Self-Supervised Learning for cross-lingual SER
Compares human and model performance at multiple levels
Investigates dialect impact on emotion recognition