When Tone and Words Disagree: Towards Robust Speech Emotion Recognition under Acoustic-Semantic Conflict

πŸ“… 2026-01-08
πŸ›οΈ arXiv.org
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
This work addresses the significant performance degradation of existing speech emotion recognition models when acoustic expressions and semantic content are in conflictβ€”a limitation primarily caused by entangled acoustic-semantic representations and semantic bias. To mitigate this, we propose the FAS framework, which explicitly decouples acoustic and semantic pathways and integrates them via a lightweight query-based attention mechanism. We further introduce CASE, the first benchmark dataset specifically designed around acoustic-semantic conflicts, to rigorously evaluate model robustness in such scenarios. Experimental results demonstrate that FAS achieves 59.38% accuracy on CASE, substantially outperforming current methods under both in-domain and zero-shot settings, thereby effectively alleviating performance collapse in conflicting emotional contexts.

Technology Category

Application Category

πŸ“ Abstract
Speech Emotion Recognition (SER) systems often assume congruence between vocal emotion and lexical semantics. However, in real-world interactions, acoustic-semantic conflict is common yet overlooked, where the emotion conveyed by tone contradicts the literal meaning of spoken words. We show that state-of-the-art SER models, including ASR-based, self-supervised learning (SSL) approaches and Audio Language Models (ALMs), suffer performance degradation under such conflicts due to semantic bias or entangled acoustic-semantic representations. To address this, we propose the Fusion Acoustic-Semantic (FAS) framework, which explicitly disentangles acoustic and semantic pathways and bridges them through a lightweight, query-based attention module. To enable systematic evaluation, we introduce the Conflict in Acoustic-Semantic Emotion (CASE), the first dataset dominated by clear and interpretable acoustic-semantic conflicts in varied scenarios. Extensive experiments demonstrate that FAS consistently outperforms existing methods in both in-domain and zero-shot settings. Notably, on the CASE benchmark, conventional SER models fail dramatically, while FAS sets a new SOTA with 59.38% accuracy. Our code and datasets is available at https://github.com/24DavidHuang/FAS.
Problem

Research questions and friction points this paper is trying to address.

Speech Emotion Recognition
Acoustic-Semantic Conflict
Semantic Bias
Emotion Recognition Robustness
Innovation

Methods, ideas, or system contributions that make the work stand out.

acoustic-semantic conflict
disentangled representation
speech emotion recognition
query-based attention
robustness
πŸ”Ž Similar Papers
No similar papers found.
D
Dawei Huang
Inclusion AI, Ant Group
Y
Yongjie Lv
Inclusion AI, Ant Group
R
Ruijie Xiong
Inclusion AI, Ant Group
C
Chunxiang Jin
Inclusion AI, Ant Group
Xiaojiang Peng
Xiaojiang Peng
Shenzhen Technology University
Computer VisionFacial Expression RecognitionMultimodal Emotion Recognition