Semantic-Emotional Resonance Embedding: A Semi-Supervised Paradigm for Cross-Lingual Speech Emotion Recognition

📅 2026-04-08
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the challenge of cross-lingual speech emotion recognition in low-resource languages, where performance is hindered by the scarcity of labeled data and translation alignments. To overcome this limitation, the authors propose a semi-supervised approach that requires neither target-language labels nor alignment information. The method constructs an emotion–semantic structure and incorporates an Instantaneous Resonance Field (IRF) to model human emotional experience, guiding unlabeled samples toward self-organization. Additionally, a semantic–emotional resonance embedding mechanism and a Triple Resonance Interaction Chain (TRIC) loss are introduced to enhance collaborative representation between labeled and unlabeled samples in emotion-critical regions. Remarkably, with only 5-shot labeled data from the source language, the approach achieves substantial performance gains across multiple low-resource languages, demonstrating both efficacy and strong generalization capability.
📝 Abstract
Cross-lingual Speech Emotion Recognition (CLSER) aims to identify emotional states in unseen languages. However, existing methods heavily rely on the semantic synchrony of complete labels and static feature stability, hindering low-resource languages from reaching high-resource performance. To address this, we propose a semi-supervised framework based on Semantic-Emotional Resonance Embedding (SERE), a cross-lingual dynamic feature paradigm that requires neither target language labels nor translation alignment. Specifically, SERE constructs an emotion-semantic structure using a small number of labeled samples. It learns human emotional experiences through an Instantaneous Resonance Field (IRF), enabling unlabeled samples to self-organize into this structure. This achieves semi-supervised semantic guidance and structural discovery. Additionally, we design a Triple-Resonance Interaction Chain (TRIC) loss to enable the model to reinforce the interaction and embedding capabilities between labeled and unlabeled samples during emotional highlights. Extensive experiments across multiple languages demonstrate the effectiveness of our method, requiring only 5-shot labeling in the source language.
Problem

Research questions and friction points this paper is trying to address.

Cross-lingual Speech Emotion Recognition
low-resource languages
semantic synchrony
static feature stability
emotional state identification
Innovation

Methods, ideas, or system contributions that make the work stand out.

Semantic-Emotional Resonance Embedding
Cross-Lingual Speech Emotion Recognition
Semi-Supervised Learning
Instantaneous Resonance Field
Triple-Resonance Interaction Chain
🔎 Similar Papers
No similar papers found.
Y
Ya Zhao
School of Computer Science and Technology, Xinjiang University, Urumqi, China; Pengcheng Laboratory Xinjiang Network Node; Xinjiang Multimodal Intelligent Processing and Information Security Engineering Technology Research Center; Joint Research Laboratory for Embodied Intelligence, Xinjiang University; Joint International Research Laboratory of Silk Road Multilingual Cognitive Computing, Xinjiang University
Yinfeng Yu
Yinfeng Yu
Associate Professor, Xinjiang University
Embodied intelligence
L
Liejun Wang
School of Computer Science and Technology, Xinjiang University, Urumqi, China; Pengcheng Laboratory Xinjiang Network Node; Xinjiang Multimodal Intelligent Processing and Information Security Engineering Technology Research Center; Joint Research Laboratory for Embodied Intelligence, Xinjiang University; Joint International Research Laboratory of Silk Road Multilingual Cognitive Computing, Xinjiang University