Multi-Teacher Language-Aware Knowledge Distillation for Multilingual Speech Emotion Recognition

📅 2025-06-10

📈 Citations: 0

✨ Influential: 0

career value

188K/year

🤖 AI Summary

To address the limited generalizability and scalability of monolingual models in multilingual speech emotion recognition (SER), this paper proposes a language-aware multi-teacher knowledge distillation framework. Methodologically, monolingual teacher models—built upon Wav2Vec 2.0—are trained separately on English, Finnish, and French data; a language identifier embedding and hierarchical attention mechanism are introduced to explicitly model language-specific characteristics, thereby guiding the student model to learn language-adaptive emotional representations. This framework achieves the first structured cross-lingual knowledge transfer in SER. Experiments demonstrate weighted recall scores of 72.9% and 63.4% on English and Finnish test sets, respectively—substantially outperforming fine-tuning and conventional distillation baselines. Notably, improvements are most pronounced for sadness and neutral emotion recognition. These results validate the efficacy and generalization advantage of language-aware distillation in multilingual SER.

Technology Category

Application Category

📝 Abstract

Speech Emotion Recognition (SER) is crucial for improving human-computer interaction. Despite strides in monolingual SER, extending them to build a multilingual system remains challenging. Our goal is to train a single model capable of multilingual SER by distilling knowledge from multiple teacher models. To address this, we introduce a novel language-aware multi-teacher knowledge distillation method to advance SER in English, Finnish, and French. It leverages Wav2Vec2.0 as the foundation of monolingual teacher models and then distills their knowledge into a single multilingual student model. The student model demonstrates state-of-the-art performance, with a weighted recall of 72.9 on the English dataset and an unweighted recall of 63.4 on the Finnish dataset, surpassing fine-tuning and knowledge distillation baselines. Our method excels in improving recall for sad and neutral emotions, although it still faces challenges in recognizing anger and happiness.

Problem

Research questions and friction points this paper is trying to address.

Extending monolingual SER to multilingual systems remains challenging

Distilling knowledge from multiple teachers into a single multilingual model

Improving recall for sad and neutral emotions in multilingual SER

Innovation

Methods, ideas, or system contributions that make the work stand out.

Language-aware multi-teacher knowledge distillation

Wav2Vec2.0 as foundation for teacher models

Single multilingual student model with high recall

🔎 Similar Papers

Improving Speech Emotion Recognition in Under-Resourced Languages via Speech-to-Speech Translation with Bootstrapping Data Selection