🤖 AI Summary
Speech emotion recognition (SER) models suffer from poor generalization under cross-domain acoustic conditions. Although Contrastive Language–Audio Pretraining (CLAP) exhibits strong multimodal alignment capability, it lacks emotion-specific modeling mechanisms. To address this, we propose ACPT—a soft prompt tuning framework comprising three key components: (1) enhanced emotion feature encoding within CLAP; (2) text-driven acoustic context prompting for unsupervised domain adaptation; and (3) cross-modal classifier transfer to mitigate domain shift between textual supervision and speech recognition. ACPT requires only fine-tuning of learnable prompt vectors—no backbone modification or additional audio annotations are needed. Evaluated on five benchmark datasets, ACPT consistently outperforms the CLAP baseline under both supervised and domain-generalization settings, achieving state-of-the-art performance.
📝 Abstract
Speech Emotion Recognition (SER) is fundamental to affective computing and human-computer interaction, yet existing models struggle to generalize across diverse acoustic conditions. While Contrastive Language-Audio Pretraining (CLAP) provides strong multimodal alignment, it lacks dedicated mechanisms for capturing emotional cues, making it suboptimal for SER. To address this, we propose CLEP-DG, a framework that enhances CLAP's robustness in emotion recognition. First, we fine-tune CLAP to obtain CLEP, adapting it on large-scale emotional speech datasets to better encode emotion-relevant features. Then, we introduce Acoustic Context Prompt Tuning (ACPT), a text-driven augmentation strategy that optimizes learnable prompt vectors to model diverse acoustic environments without additional labeled audio. Finally, leveraging cross-modal transferability, we train a classifier on text-derived embeddings and apply it to the audio encoder during inference, mitigating domain shifts between textual supervision and audio-based emotion recognition. Experiments across five benchmark datasets show that CLEP-DG outperforms prior CLAP-based approaches, achieving state-of-the-art performance in both supervised and domain generalization settings.