CLEP-DG: Contrastive Learning for Speech Emotion Domain Generalization via Soft Prompt Tuning

📅 2025-07-05

📈 Citations: 0

✨ Influential: 0

career value

185K/year

🤖 AI Summary

Speech emotion recognition (SER) models suffer from poor generalization under cross-domain acoustic conditions. Although Contrastive Language–Audio Pretraining (CLAP) exhibits strong multimodal alignment capability, it lacks emotion-specific modeling mechanisms. To address this, we propose ACPT—a soft prompt tuning framework comprising three key components: (1) enhanced emotion feature encoding within CLAP; (2) text-driven acoustic context prompting for unsupervised domain adaptation; and (3) cross-modal classifier transfer to mitigate domain shift between textual supervision and speech recognition. ACPT requires only fine-tuning of learnable prompt vectors—no backbone modification or additional audio annotations are needed. Evaluated on five benchmark datasets, ACPT consistently outperforms the CLAP baseline under both supervised and domain-generalization settings, achieving state-of-the-art performance.

Technology Category

Application Category

📝 Abstract

Speech Emotion Recognition (SER) is fundamental to affective computing and human-computer interaction, yet existing models struggle to generalize across diverse acoustic conditions. While Contrastive Language-Audio Pretraining (CLAP) provides strong multimodal alignment, it lacks dedicated mechanisms for capturing emotional cues, making it suboptimal for SER. To address this, we propose CLEP-DG, a framework that enhances CLAP's robustness in emotion recognition. First, we fine-tune CLAP to obtain CLEP, adapting it on large-scale emotional speech datasets to better encode emotion-relevant features. Then, we introduce Acoustic Context Prompt Tuning (ACPT), a text-driven augmentation strategy that optimizes learnable prompt vectors to model diverse acoustic environments without additional labeled audio. Finally, leveraging cross-modal transferability, we train a classifier on text-derived embeddings and apply it to the audio encoder during inference, mitigating domain shifts between textual supervision and audio-based emotion recognition. Experiments across five benchmark datasets show that CLEP-DG outperforms prior CLAP-based approaches, achieving state-of-the-art performance in both supervised and domain generalization settings.

Problem

Research questions and friction points this paper is trying to address.

Improving speech emotion recognition across diverse acoustic conditions

Enhancing CLAP's emotion cue capture for better SER performance

Mitigating domain shifts between text and audio emotion recognition

Innovation

Methods, ideas, or system contributions that make the work stand out.

Fine-tunes CLAP for emotion feature encoding

Uses text-driven Acoustic Context Prompt Tuning

Trains classifier on text-derived audio embeddings

🔎 Similar Papers

No similar papers found.