EMO-RL: Emotion-Rule-Based Reinforcement Learning Enhanced Audio-Language Model for Generalized Speech Emotion Recognition

📅 2025-09-19
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Large audio-language models suffer from unstable convergence in speech emotion recognition due to ambiguous emotion boundaries, while smaller models (e.g., 7B) lack sufficient reasoning capacity. To address these issues, we propose a Group-wise Relative Policy Optimization (GRPO) framework integrating emotion similarity-weighted rewards, explicit structured reasoning, and emotion rule constraints. Building upon pre-trained audio-language models, GRPO models fine-grained emotion similarity, guides stepwise reasoning, and employs intra-group relative advantage updates to enhance discrimination and generalization of subtle cross-contextual emotions. Our method achieves state-of-the-art performance on MELD and IEMOCAP, with cross-dataset experiments demonstrating superior robustness and training stability.

Technology Category

Application Category

📝 Abstract
Although Large Audio-Language Models (LALMs) have exhibited outstanding performance in auditory understanding, their performance in affective computing scenarios, particularly in emotion recognition, reasoning, and subtle sentiment differentiation, remains suboptimal. Recent advances in Reinforcement Learning (RL) have shown promise in improving LALMs' reasoning abilities. However, two critical challenges hinder the direct application of RL techniques to Speech Emotion Recognition (SER) tasks: (1) convergence instability caused by ambiguous emotional boundaries and (2) limited reasoning ability when using relatively small models (e.g., 7B-parameter architectures). To overcome these limitations, we introduce EMO-RL, a novel framework incorporating reinforcement learning with two key innovations: Emotion Similarity-Weighted Reward (ESWR) and Explicit Structured Reasoning (ESR). Built upon pretrained LALMs, our method employs group-relative policy optimization with emotion constraints. Comprehensive experiments demonstrate that our EMO-RL training strategies can significantly enhance the emotional reasoning capabilities of LALMs, attaining state-of-the-art results on both the MELD and IEMOCAP datasets, and cross-dataset experiments prove the strong superiority of generalization.
Problem

Research questions and friction points this paper is trying to address.

Improving emotion recognition in audio-language models
Overcoming reinforcement learning convergence instability
Enhancing reasoning in small models for emotion tasks
Innovation

Methods, ideas, or system contributions that make the work stand out.

Emotion Similarity-Weighted Reward mechanism
Explicit Structured Reasoning framework
Group-relative policy optimization with constraints
🔎 Similar Papers
No similar papers found.
Pengcheng Li
Pengcheng Li
Ph.D. of Computer Science, University of Rochester; Google (present)
Programming SystemsCompilersRuntimes.
B
Botao Zhao
Ping An Technology (Shenzhen) Co., Ltd., Shenzhen, China
Z
Zuheng Kang
Ping An Technology (Shenzhen) Co., Ltd., Shenzhen, China
J
Junqing Peng
Ping An Technology (Shenzhen) Co., Ltd., Shenzhen, China
X
Xiaoyang Qu
Ping An Technology (Shenzhen) Co., Ltd., Shenzhen, China
Y
Yayun He
Ping An Technology (Shenzhen) Co., Ltd., Shenzhen, China
Jianzong Wang
Jianzong Wang
Postdoctoral Researcher of Department of Electrical and Computer Engineering, University of Florida
Big DataStorage SystemCloud Computing