EMORL-TTS: Reinforcement Learning for Fine-Grained Emotion Control in LLM-based TTS

📅 2025-10-07
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Current LLM-based TTS systems rely on discrete speech tokens, limiting fine-grained emotional control: emotion is often reduced to coarse categorical labels, and joint modeling of intensity and local prosodic prominence remains unaddressed. This paper proposes EMORL-TTS, the first framework to jointly model global emotional intensity and local accent positions within a continuous Valence-Arousal-Dominance (VAD) affective space. It integrates supervised fine-tuning with task-driven reinforcement learning to enable zero-shot, multi-dimensional emotional controllability. Key innovations include: (1) continuous VAD-guided intensity modulation; (2) an accent-aware reward function for reinforcement learning; and (3) a lightweight adaptation strategy compatible with mainstream LLM-TTS architectures. Experiments demonstrate significant improvements in emotional accuracy, intensity discrimination, and accent clarity, while maintaining naturalness comparable to state-of-the-art methods.

Technology Category

Application Category

📝 Abstract
Recent LLM-based TTS systems achieve strong quality and zero-shot ability, but lack fine-grained emotional control due to their reliance on discrete speech tokens. Existing approaches either limit emotions to categorical labels or cannot generalize to LLM-based architectures. We propose EMORL-TTS (Fine-grained Emotion-controllable TTS with Reinforcement Learning), a framework that unifies global intensity control in the VAD space with local emphasis regulation. Our method combines supervised fine-tuning with reinforcement learning guided by task-specific rewards for emotion category, intensity, and emphasis. Moreover, we further investigate how emphasis placement modulates fine-grained emotion intensity. Experiments show that EMORL-TTS improves emotion accuracy, intensity differentiation, and emphasis clarity, while preserving synthesis quality comparable to strong LLM-based baselines.
Problem

Research questions and friction points this paper is trying to address.

Achieving fine-grained emotional control in LLM-based TTS systems
Overcoming limitations of discrete speech tokens for emotion modulation
Unifying global intensity control with local emphasis regulation
Innovation

Methods, ideas, or system contributions that make the work stand out.

Uses reinforcement learning for emotion control
Combines supervised fine-tuning with RL rewards
Unifies global intensity and local emphasis regulation
🔎 Similar Papers
No similar papers found.
H
Haoxun Li
Hangzhou Institute for Advanced Study, University of Chinese Academy of Sciences
Y
Yu Liu
Hangzhou Institute for Advanced Study, University of Chinese Academy of Sciences
Y
Yuqing Sun
Hangzhou Institute for Advanced Study, University of Chinese Academy of Sciences
H
Hanlei Shi
Hangzhou Institute for Advanced Study, University of Chinese Academy of Sciences
Leyuan Qu
Leyuan Qu
Hangzhou Institute for Advanced Study, UCAS
Speech Representation LearningMulti-modal Learning and Affective Computing
T
Taihao Li
Hangzhou Institute for Advanced Study, University of Chinese Academy of Sciences