🤖 AI Summary
Cross-lingual speech emotion recognition (SER) faces dual challenges: cross-lingual acoustic variation and speaker-specific stylistic differences. To address these, we propose a phoneme-speaker dual-space alignment framework with explicit style awareness. First, a graph neural network constructs emotion-specific speaker communities to achieve speaker clustering. Second, phonemes serve as anchors to jointly align emotional representations in both phoneme and speaker embedding spaces. Finally, dual-space embedding mapping coupled with cross-lingual transfer learning enhances generalization across languages and speakers. Evaluated on MSP-Podcast and BIIC-Podcast—two challenging cross-lingual SER benchmarks—our method significantly outperforms state-of-the-art baselines. Results demonstrate its effectiveness in modeling shared emotional expression patterns across languages and speakers, validating its capacity to capture language-invariant and speaker-robust emotional cues. This work provides a structured, transferable solution for low-resource-language SER, advancing robust cross-lingual affective computing.
📝 Abstract
Cross-lingual speech emotion recognition (SER) remains a challenging task due to differences in phonetic variability and speaker-specific expressive styles across languages. Effectively capturing emotion under such diverse conditions requires a framework that can align the externalization of emotions across different speakers and languages. To address this problem, we propose a speaker-style aware phoneme anchoring framework that aligns emotional expression at the phonetic and speaker levels. Our method builds emotion-specific speaker communities via graph-based clustering to capture shared speaker traits. Using these groups, we apply dual-space anchoring in speaker and phonetic spaces to enable better emotion transfer across languages. Evaluations on the MSP-Podcast (English) and BIIC-Podcast (Taiwanese Mandarin) corpora demonstrate improved generalization over competitive baselines and provide valuable insights into the commonalities in cross-lingual emotion representation.