Speaker Style-Aware Phoneme Anchoring for Improved Cross-Lingual Speech Emotion Recognition

📅 2025-09-19
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Cross-lingual speech emotion recognition (SER) faces dual challenges: cross-lingual acoustic variation and speaker-specific stylistic differences. To address these, we propose a phoneme-speaker dual-space alignment framework with explicit style awareness. First, a graph neural network constructs emotion-specific speaker communities to achieve speaker clustering. Second, phonemes serve as anchors to jointly align emotional representations in both phoneme and speaker embedding spaces. Finally, dual-space embedding mapping coupled with cross-lingual transfer learning enhances generalization across languages and speakers. Evaluated on MSP-Podcast and BIIC-Podcast—two challenging cross-lingual SER benchmarks—our method significantly outperforms state-of-the-art baselines. Results demonstrate its effectiveness in modeling shared emotional expression patterns across languages and speakers, validating its capacity to capture language-invariant and speaker-robust emotional cues. This work provides a structured, transferable solution for low-resource-language SER, advancing robust cross-lingual affective computing.

Technology Category

Application Category

📝 Abstract
Cross-lingual speech emotion recognition (SER) remains a challenging task due to differences in phonetic variability and speaker-specific expressive styles across languages. Effectively capturing emotion under such diverse conditions requires a framework that can align the externalization of emotions across different speakers and languages. To address this problem, we propose a speaker-style aware phoneme anchoring framework that aligns emotional expression at the phonetic and speaker levels. Our method builds emotion-specific speaker communities via graph-based clustering to capture shared speaker traits. Using these groups, we apply dual-space anchoring in speaker and phonetic spaces to enable better emotion transfer across languages. Evaluations on the MSP-Podcast (English) and BIIC-Podcast (Taiwanese Mandarin) corpora demonstrate improved generalization over competitive baselines and provide valuable insights into the commonalities in cross-lingual emotion representation.
Problem

Research questions and friction points this paper is trying to address.

Addressing phonetic variability and speaker style differences in cross-lingual emotion recognition
Aligning emotional expression across different speakers and languages effectively
Improving emotion transfer between languages using speaker and phonetic space anchoring
Innovation

Methods, ideas, or system contributions that make the work stand out.

Speaker-style aware phoneme anchoring for emotion alignment
Graph-based clustering to build emotion-specific speaker communities
Dual-space anchoring in speaker and phonetic spaces
🔎 Similar Papers
No similar papers found.
Shreya G. Upadhyay
Shreya G. Upadhyay
National Tsing Hua University
Machine LearningAffective ComputingBehavioral Speech Signal ProcessingSpeech Emotion
C
Carlos Busso
Language Technologies Institute, Carnegie Mellon University, USA
C
Chi-Chun Lee
Department of Electrical Engineering, National Tsing Hua University, Taiwan