Improving Speech Emotion Recognition in Under-Resourced Languages via Speech-to-Speech Translation with Bootstrapping Data Selection

📅 2024-09-17
🏛️ arXiv.org
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address the scarcity of labeled data for speech emotion recognition (SER) in low-resource languages, this paper proposes the first cross-lingual data augmentation paradigm integrating expressive speech-to-speech translation (S2ST) with self-iterative guided data filtering. Our method leverages labeled speech from high-resource languages to generate target-language speech via S2ST, preserving both emotional semantics and prosodic characteristics. We introduce a dual-criterion sampling mechanism based on emotional consistency and model confidence, coupled with multi-stage pseudo-label refinement and a cross-lingual fine-tuning framework—requiring no human annotations in the target language. Evaluated on five low-resource languages, our approach achieves an average SER accuracy improvement of 12.3%, significantly outperforming ASR+TTS and text-based back-translation baselines. Moreover, it demonstrates strong generalizability across diverse upstream speech models.

Technology Category

Application Category

📝 Abstract
Speech Emotion Recognition (SER) is a crucial component in developing general-purpose AI agents capable of natural human-computer interaction. However, building robust multilingual SER systems remains challenging due to the scarcity of labeled data in languages other than English and Chinese. In this paper, we propose an approach to enhance SER performance in low SER resource languages by leveraging data from high-resource languages. Specifically, we employ expressive Speech-to-Speech translation (S2ST) combined with a novel bootstrapping data selection pipeline to generate labeled data in the target language. Extensive experiments demonstrate that our method is both effective and generalizable across different upstream models and languages. Our results suggest that this approach can facilitate the development of more scalable and robust multilingual SER systems.
Problem

Research questions and friction points this paper is trying to address.

Multilingual Emotion Recognition
Under-resourced Languages
Speech Recognition
Innovation

Methods, ideas, or system contributions that make the work stand out.

Speech Translation
Data Filtering
Multi-lingual Emotional Recognition
🔎 Similar Papers
No similar papers found.