MECap-R1: Emotion-aware Policy with Reinforcement Learning for Multimodal Emotion Captioning

📅 2025-09-23

📈 Citations: 0

✨ Influential: 0

career value

219K/year

🤖 AI Summary

Traditional discrete emotion classification fails to capture the fine-grained, continuous emotional semantics inherent in speech. To address this, we propose a reinforcement learning–based multimodal approach for natural language description of speech emotions, jointly modeling acoustic and textual features to construct an emotion-aware policy network. Our key contribution is the Emo-GRPO reward mechanism—the first application of Grouped Relative Policy Optimization to emotion-oriented text generation—which overcomes the limitations of fixed heuristic rules in dynamic generation. Evaluated on the EmotionTalk dataset, our method achieves significant improvements over state-of-the-art approaches: +3.2 BLEU-4 (descriptive accuracy), +4.7 Emo-F1 (emotion consistency), and +8.1% Distinct-2 (lexical diversity). This work establishes an interpretable, fine-grained generative paradigm for speech emotion understanding.

Technology Category

Application Category

📝 Abstract

Speech Emotion Captioning (SEC) has emerged as a notable research direction. The inherent complexity of emotional content in human speech makes it challenging for traditional discrete classification methods to provide an adequate representation. Consequently, utilizing natural language to describe speech emotions presents a novel avenue for more effectively capturing and expressing affect. In this paper, we propose MECap-R1, a pioneering emotion-aware policy with reinforcement learning for multimodal emotion captioning. By employing Group Relative Policy Optimization with emotion-aware reward (Emo-GRPO), the framework precisely captures the emotion and semantic features, thereby addressing the shortcomings of rigid rules in handling the dynamic and flexible nature of captions. Experimental results on the EmotionTalk dataset demonstrate that MECap-R1 performs well in generating emotion descriptions and achieves substantial gains in both accuracy and diversity.

Problem

Research questions and friction points this paper is trying to address.

Overcoming limitations of discrete classification for speech emotions

Generating natural language descriptions for emotional speech content

Addressing rigid rule shortcomings in dynamic emotion captioning

Innovation

Methods, ideas, or system contributions that make the work stand out.

Reinforcement learning with emotion-aware reward policy

Group Relative Policy Optimization for multimodal captioning

Captures emotion and semantic features dynamically

🔎 Similar Papers

Contextual Emotion Recognition using Large Vision Language Models