Speaker Style-Aware Phoneme Anchoring for Improved Cross-Lingual Speech Emotion Recognition

📅 2025-09-19

📈 Citations: 0

✨ Influential: 0

career value

208K/year

🤖 AI Summary

Cross-lingual speech emotion recognition (SER) faces dual challenges: cross-lingual acoustic variation and speaker-specific stylistic differences. To address these, we propose a phoneme-speaker dual-space alignment framework with explicit style awareness. First, a graph neural network constructs emotion-specific speaker communities to achieve speaker clustering. Second, phonemes serve as anchors to jointly align emotional representations in both phoneme and speaker embedding spaces. Finally, dual-space embedding mapping coupled with cross-lingual transfer learning enhances generalization across languages and speakers. Evaluated on MSP-Podcast and BIIC-Podcast—two challenging cross-lingual SER benchmarks—our method significantly outperforms state-of-the-art baselines. Results demonstrate its effectiveness in modeling shared emotional expression patterns across languages and speakers, validating its capacity to capture language-invariant and speaker-robust emotional cues. This work provides a structured, transferable solution for low-resource-language SER, advancing robust cross-lingual affective computing.

Technology Category

Application Category

📝 Abstract

Cross-lingual speech emotion recognition (SER) remains a challenging task due to differences in phonetic variability and speaker-specific expressive styles across languages. Effectively capturing emotion under such diverse conditions requires a framework that can align the externalization of emotions across different speakers and languages. To address this problem, we propose a speaker-style aware phoneme anchoring framework that aligns emotional expression at the phonetic and speaker levels. Our method builds emotion-specific speaker communities via graph-based clustering to capture shared speaker traits. Using these groups, we apply dual-space anchoring in speaker and phonetic spaces to enable better emotion transfer across languages. Evaluations on the MSP-Podcast (English) and BIIC-Podcast (Taiwanese Mandarin) corpora demonstrate improved generalization over competitive baselines and provide valuable insights into the commonalities in cross-lingual emotion representation.

Problem

Research questions and friction points this paper is trying to address.

Addressing phonetic variability and speaker style differences in cross-lingual emotion recognition

Aligning emotional expression across different speakers and languages effectively

Improving emotion transfer between languages using speaker and phonetic space anchoring

Innovation

Methods, ideas, or system contributions that make the work stand out.

Speaker-style aware phoneme anchoring for emotion alignment

Graph-based clustering to build emotion-specific speaker communities

Dual-space anchoring in speaker and phonetic spaces

🔎 Similar Papers

Cross-lingual Speech Emotion Recognition: Humans vs. Self-Supervised Models