EMORL-TTS: Reinforcement Learning for Fine-Grained Emotion Control in LLM-based TTS

📅 2025-10-07

📈 Citations: 0

✨ Influential: 0

career value

190K/year

🤖 AI Summary

Current LLM-based TTS systems rely on discrete speech tokens, limiting fine-grained emotional control: emotion is often reduced to coarse categorical labels, and joint modeling of intensity and local prosodic prominence remains unaddressed. This paper proposes EMORL-TTS, the first framework to jointly model global emotional intensity and local accent positions within a continuous Valence-Arousal-Dominance (VAD) affective space. It integrates supervised fine-tuning with task-driven reinforcement learning to enable zero-shot, multi-dimensional emotional controllability. Key innovations include: (1) continuous VAD-guided intensity modulation; (2) an accent-aware reward function for reinforcement learning; and (3) a lightweight adaptation strategy compatible with mainstream LLM-TTS architectures. Experiments demonstrate significant improvements in emotional accuracy, intensity discrimination, and accent clarity, while maintaining naturalness comparable to state-of-the-art methods.

Technology Category

Application Category

📝 Abstract

Recent LLM-based TTS systems achieve strong quality and zero-shot ability, but lack fine-grained emotional control due to their reliance on discrete speech tokens. Existing approaches either limit emotions to categorical labels or cannot generalize to LLM-based architectures. We propose EMORL-TTS (Fine-grained Emotion-controllable TTS with Reinforcement Learning), a framework that unifies global intensity control in the VAD space with local emphasis regulation. Our method combines supervised fine-tuning with reinforcement learning guided by task-specific rewards for emotion category, intensity, and emphasis. Moreover, we further investigate how emphasis placement modulates fine-grained emotion intensity. Experiments show that EMORL-TTS improves emotion accuracy, intensity differentiation, and emphasis clarity, while preserving synthesis quality comparable to strong LLM-based baselines.

Problem

Research questions and friction points this paper is trying to address.

Achieving fine-grained emotional control in LLM-based TTS systems

Overcoming limitations of discrete speech tokens for emotion modulation

Unifying global intensity control with local emphasis regulation

Innovation

Methods, ideas, or system contributions that make the work stand out.

Uses reinforcement learning for emotion control

Combines supervised fine-tuning with RL rewards

Unifies global intensity and local emphasis regulation

🔎 Similar Papers

No similar papers found.

Nvidia

30 USD - 94 USD

US, CA, Santa Clara

AI Research Scientist - Voice AI Team, Meta Superintelligence Labs