Steering Language Model to Stable Speech Emotion Recognition via Contextual Perception and Chain of Thought

📅 2025-02-25
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address hallucination and output instability in large-scale Audio-Language Models (ALMs) for Speech Emotion Recognition (SER), this paper proposes C²SER, a dual-channel context-aware model integrating semantic and acoustic perception. Methodologically: (1) it introduces the first explicit-to-implicit Chain-of-Thought (CoT) self-distillation framework to enhance reasoning robustness; (2) it designs a semi-supervised enhanced Emotion2Vec-S acoustic encoder; and (3) it synergizes Whisper-based semantic encoding with multi-granularity emotion perception to achieve cross-modal feature alignment and chained inference. Evaluated on multiple benchmarks, C²SER achieves a 5.2% absolute improvement in emotion classification accuracy over Qwen2-Audio and SECAP, while reducing error rate variance by 38%. These results demonstrate significant gains in both model stability and interpretability, establishing a new state of the art in robust, explainable SER.

Technology Category

Application Category

📝 Abstract
Large-scale audio language models (ALMs), such as Qwen2-Audio, are capable of comprehending diverse audio signal, performing audio analysis and generating textual responses. However, in speech emotion recognition (SER), ALMs often suffer from hallucinations, resulting in misclassifications or irrelevant outputs. To address these challenges, we propose C$^2$SER, a novel ALM designed to enhance the stability and accuracy of SER through Contextual perception and Chain of Thought (CoT). C$^2$SER integrates the Whisper encoder for semantic perception and Emotion2Vec-S for acoustic perception, where Emotion2Vec-S extends Emotion2Vec with semi-supervised learning to enhance emotional discrimination. Additionally, C$^2$SER employs a CoT approach, processing SER in a step-by-step manner while leveraging speech content and speaking styles to improve recognition. To further enhance stability, C$^2$SER introduces self-distillation from explicit CoT to implicit CoT, mitigating error accumulation and boosting recognition accuracy. Extensive experiments show that C$^2$SER outperforms existing popular ALMs, such as Qwen2-Audio and SECap, delivering more stable and precise emotion recognition. We release the training code, checkpoints, and test sets to facilitate further research.
Problem

Research questions and friction points this paper is trying to address.

Enhance speech emotion recognition stability
Reduce hallucinations in audio language models
Improve accuracy via contextual and acoustic perception
Innovation

Methods, ideas, or system contributions that make the work stand out.

Contextual perception enhances SER accuracy
Chain of Thought improves recognition stability
Self-distillation reduces error accumulation
🔎 Similar Papers
No similar papers found.