Evaluating Emotion Recognition in Spoken Language Models on Emotionally Incongruent Speech

📅 2025-10-28
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work investigates the cross-modal fusion capability of spoken language models (SLMs) under speech-text sentiment inconsistency. To address the limitation that existing SLMs over-rely on textual semantics while neglecting acoustic sentiment cues, we propose a novel sentiment-inconsistent speech evaluation paradigm and introduce EMIS—the first controllable synthetic dataset comprising speech samples with conflicting semantic and prosodic sentiment. Leveraging four state-of-the-art SLMs, we conduct cross-modal attention analysis and ablation studies. Results demonstrate that current SLMs predominantly base sentiment predictions on text, with minimal contribution from acoustic features, revealing a severe modality imbalance in their cross-modal fusion mechanisms. This study is the first to systematically expose such modality bias in SLMs’ sentiment understanding. We publicly release the EMIS dataset and associated code to establish a benchmark and guide future development of robust multimodal sentiment models.

Technology Category

Application Category

📝 Abstract
Advancements in spoken language processing have driven the development of spoken language models (SLMs), designed to achieve universal audio understanding by jointly learning text and audio representations for a wide range of tasks. Although promising results have been achieved, there is growing discussion regarding these models' generalization capabilities and the extent to which they truly integrate audio and text modalities in their internal representations. In this work, we evaluate four SLMs on the task of speech emotion recognition using a dataset of emotionally incongruent speech samples, a condition under which the semantic content of the spoken utterance conveys one emotion while speech expressiveness conveys another. Our results indicate that SLMs rely predominantly on textual semantics rather than speech emotion to perform the task, indicating that text-related representations largely dominate over acoustic representations. We release both the code and the Emotionally Incongruent Synthetic Speech dataset (EMIS) to the community.
Problem

Research questions and friction points this paper is trying to address.

Evaluating emotion recognition in spoken language models
Assessing model reliance on text versus acoustic cues
Testing generalization on emotionally incongruent speech samples
Innovation

Methods, ideas, or system contributions that make the work stand out.

Evaluating SLMs on emotionally incongruent speech samples
Assessing reliance on textual versus acoustic representations
Releasing EMIS dataset and code for community use
🔎 Similar Papers
No similar papers found.
P
Pedro Corrêa
School of Electrical and Computer Engineering, Universidade Estadual de Campinas (UNICAMP), Campinas, Brazil
J
João Lima
School of Electrical and Computer Engineering, Universidade Estadual de Campinas (UNICAMP), Campinas, Brazil
V
Victor Moreno
School of Electrical and Computer Engineering, Universidade Estadual de Campinas (UNICAMP), Campinas, Brazil
Paula Dornhofer Paro Costa
Paula Dornhofer Paro Costa
Professor University of Campinas
Digital Image Synthesis and AnalysisMachine LearningVisualization