EmotionThinker: Prosody-Aware Reinforcement Learning for Explainable Speech Emotion Reasoning

📅 2026-01-22
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the limitations of current speech-based large language models in emotion recognition, which typically treat the task as a simple classification problem, lacking interpretability and underutilizing reasoning capabilities. To overcome this, the study reframes emotion recognition as a deep reasoning problem and introduces a prosody-enhanced multimodal foundation model. The authors also construct EmotionCoT-35K, the first chain-of-thought annotated dataset for speech emotion recognition. By incorporating Group-Relative Policy Optimization (GRPO) and a Progressive Trust-aware Reasoning Reward (PTR) mechanism, the model enables interpretable emotion predictions grounded in fine-grained acoustic cues. Experimental results demonstrate that the proposed approach surpasses state-of-the-art methods in both emotion recognition accuracy and explanation quality, advancing the field toward an interpretable, multimodal reasoning paradigm.

Technology Category

Application Category

📝 Abstract
Emotional information in speech plays a unique role in multimodal perception. However, current Speech Large Language Models (SpeechLLMs), similar to conventional speech emotion recognition (SER) systems, still treat emotion understanding as a simple classification problem. This provides limited interpretability of predictions, while leaving the LLMs'expressive and reasoning capabilities underutilized. In this work, we take the first step to reformulate SER as a deep reasoning problem through reinforcement learning (RL). We propose EmotionThinker, which is designed to generate accurate emotion predictions with interpretable explanations grounded in fine-grained acoustic cues. To achieve this, we first construct EmotionCoT-35K, an emotional reasoning dataset with Chain-of-Thought annotations and detailed captions. Second, we observe that current SpeechLLMs exhibit weak prosody perception, whereas prosodic cues constitute fundamental signals for interpreting emotions. To address this, we develop the prosody-enhanced foundation model EmotionThinker-Base, and demonstrate that prosody enhancement improves emotion understanding. Third, we introduce Group-Relative-Policy-Optimization with Progressive-Trust-aware-Reasoning-Reward (GRPO-PTR) for RL. Different from standard GRPO, which relies only on rule-based outcome rewards, GRPO-PTR progressively introduces reasoning reward, dynamically adjusts it with a trustworthiness weight reflecting the alignment between reasoning and outcome, and evaluates the overall reasoning quality with a reward model based on multi-dimensional criteria. EmotionThinker outperforms previous state-of-the-art evaluation models both in emotion accuracy and explanation quality, advancing SER toward interpretable multimodal reasoning. Project page: https://github.com/dingdongwang/EmotionThinker
Problem

Research questions and friction points this paper is trying to address.

Speech Emotion Recognition
Interpretability
Prosody
Multimodal Reasoning
Emotion Understanding
Innovation

Methods, ideas, or system contributions that make the work stand out.

prosody-aware reinforcement learning
explainable emotion reasoning
Chain-of-Thought for speech
EmotionThinker
GRPO-PTR
🔎 Similar Papers
No similar papers found.
D
Dingdong Wang
The Chinese University of Hong Kong
S
Shujie Liu
Microsoft Corporation
Tianhua Zhang
Tianhua Zhang
The Chinese University of Hong Kong
natural language processinglarge language models
Y
Youjun Chen
The Chinese University of Hong Kong
Jinyu Li
Jinyu Li
Partner Applied Science Manager, Microsoft
Acoustic ModelingSpeech RecognitionSpeech Translation
H
Helen Meng
The Chinese University of Hong Kong