🤖 AI Summary
Traditional discrete emotion classification fails to capture the fine-grained, continuous emotional semantics inherent in speech. To address this, we propose a reinforcement learning–based multimodal approach for natural language description of speech emotions, jointly modeling acoustic and textual features to construct an emotion-aware policy network. Our key contribution is the Emo-GRPO reward mechanism—the first application of Grouped Relative Policy Optimization to emotion-oriented text generation—which overcomes the limitations of fixed heuristic rules in dynamic generation. Evaluated on the EmotionTalk dataset, our method achieves significant improvements over state-of-the-art approaches: +3.2 BLEU-4 (descriptive accuracy), +4.7 Emo-F1 (emotion consistency), and +8.1% Distinct-2 (lexical diversity). This work establishes an interpretable, fine-grained generative paradigm for speech emotion understanding.
📝 Abstract
Speech Emotion Captioning (SEC) has emerged as a notable research direction. The inherent complexity of emotional content in human speech makes it challenging for traditional discrete classification methods to provide an adequate representation. Consequently, utilizing natural language to describe speech emotions presents a novel avenue for more effectively capturing and expressing affect. In this paper, we propose MECap-R1, a pioneering emotion-aware policy with reinforcement learning for multimodal emotion captioning. By employing Group Relative Policy Optimization with emotion-aware reward (Emo-GRPO), the framework precisely captures the emotion and semantic features, thereby addressing the shortcomings of rigid rules in handling the dynamic and flexible nature of captions. Experimental results on the EmotionTalk dataset demonstrate that MECap-R1 performs well in generating emotion descriptions and achieves substantial gains in both accuracy and diversity.