VowelPrompt: Hearing Speech Emotions from Text via Vowel-level Prosodic Augmentation

📅 2026-02-06
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the challenge of speech emotion recognition by integrating semantic content with fine-grained prosodic cues, which are often overlooked by current large language models (LLMs), thereby limiting both performance and interpretability. The study pioneers the use of vowels as primary carriers of emotional prosody and introduces a two-stage optimization framework: first, prosodic features—such as pitch, energy, and duration—are extracted at the vowel level and converted into natural language descriptions; then, supervised fine-tuning (SFT) combined with reinforcement learning based on verifiable rewards (RLVR/GRPO) guides the LLM to jointly reason over semantic and prosodic information. The proposed method consistently outperforms state-of-the-art approaches across multiple benchmarks under zero-shot, fine-tuned, cross-domain, and cross-lingual settings, while generating interpretable emotion analyses grounded in both semantics and prosody.

Technology Category

Application Category

📝 Abstract
Emotion recognition in speech presents a complex multimodal challenge, requiring comprehension of both linguistic content and vocal expressivity, particularly prosodic features such as fundamental frequency, intensity, and temporal dynamics. Although large language models (LLMs) have shown promise in reasoning over textual transcriptions for emotion recognition, they typically neglect fine-grained prosodic information, limiting their effectiveness and interpretability. In this work, we propose VowelPrompt, a linguistically grounded framework that augments LLM-based emotion recognition with interpretable, fine-grained vowel-level prosodic cues. Drawing on phonetic evidence that vowels serve as primary carriers of affective prosody, VowelPrompt extracts pitch-, energy-, and duration-based descriptors from time-aligned vowel segments, and converts these features into natural language descriptions for better interpretability. Such a design enables LLMs to jointly reason over lexical semantics and fine-grained prosodic variation. Moreover, we adopt a two-stage adaptation procedure comprising supervised fine-tuning (SFT) followed by Reinforcement Learning with Verifiable Reward (RLVR), implemented via Group Relative Policy Optimization (GRPO), to enhance reasoning capability, enforce structured output adherence, and improve generalization across domains and speaker variations. Extensive evaluations across diverse benchmark datasets demonstrate that VowelPrompt consistently outperforms state-of-the-art emotion recognition methods under zero-shot, fine-tuned, cross-domain, and cross-linguistic conditions, while enabling the generation of interpretable explanations that are jointly grounded in contextual semantics and fine-grained prosodic structure.
Problem

Research questions and friction points this paper is trying to address.

speech emotion recognition
prosody
large language models
vowel-level features
multimodal understanding
Innovation

Methods, ideas, or system contributions that make the work stand out.

Vowel-level prosody
Large Language Models
Prosodic augmentation
Reinforcement Learning with Verifiable Reward
Interpretable emotion recognition
🔎 Similar Papers
No similar papers found.