RRPO: Robust Reward Policy Optimization for LLM-based Emotional TTS

📅 2025-12-04
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
In large language model–driven emotional text-to-speech (TTS), differentiable reinforcement learning (e.g., DiffRO) is vulnerable to reward hacking—where the policy network generates acoustic artifacts to artificially inflate rewards, degrading perceptual speech quality. Method: We propose a robust reward optimization framework built upon differentiable RL, incorporating a hybrid regularization mechanism comprising perceptual consistency constraints and gradient penalties to enhance alignment between the reward model and human subjective judgments while suppressing spurious reward maximization. The method is validated across languages and rigorously evaluated via subjective listening tests and ablation studies. Contribution/Results: Our approach significantly outperforms baselines in both emotional expressiveness and speech naturalness, effectively mitigates reward hacking, and demonstrates strong cross-lingual generalization capability.

Technology Category

Application Category

📝 Abstract
Differentiable reinforcement learning (RL) frameworks like DiffRO offer a powerful approach for controllable text-to-speech (TTS), but are vulnerable to reward hacking, particularly for nuanced tasks like emotion control. The policy model can exploit a vanilla Reward Model (RM) by generating acoustic artifacts to achieve spurious rewards, but at the cost of degrading perceptual quality. To address this, we propose Robust Reward Policy Optimization (RRPO), a novel framework that employs a hybrid regularization scheme. This scheme develops a robust RM whose reward signal is more reliably aligned with human perception, compelling the policy to abandon detrimental shortcuts and instead learn the complex features of genuine emotions. Our ablation study confirms the enhanced robustness of our RM, as evidenced by its strong cross-lingual generalization. The subjective evaluation demonstrates that this robust RM effectively mitigates reward hacking, leading to significant improvements in both emotional expressiveness and naturalness over all baselines. Demo page: https://lrwinr.github.io/RRPO-CosyVoice.
Problem

Research questions and friction points this paper is trying to address.

Prevent reward hacking in RL-based emotional TTS
Align reward signals with human perception for emotions
Improve emotional expressiveness and naturalness in TTS
Innovation

Methods, ideas, or system contributions that make the work stand out.

Hybrid regularization scheme for robust reward model
Cross-lingual generalization of robust reward signal
Mitigates reward hacking to improve emotional expressiveness
🔎 Similar Papers
No similar papers found.
C
Cong Wang
Beijing University of Posts and Telecommunications, Beijing, China
C
Changfeng Gao
Speech Team, Tongyi Lab, Alibaba Group
Y
Yang Xiang
Speech Team, Tongyi Lab, Alibaba Group
Zhihao Du
Zhihao Du
Alibaba
Speech separationspeech enchancementspeaker diarization
K
Keyu An
Speech Team, Tongyi Lab, Alibaba Group
H
Han Zhao
Speech Team, Tongyi Lab, Alibaba Group
Q
Qian Chen
Speech Team, Tongyi Lab, Alibaba Group
Xiangang Li
Xiangang Li
Unknown affiliation
speech recognitionnatural language processing
Yingming Gao
Yingming Gao
Beijing University of Posts and Telecommunications
Computer Assisted Language LearningAcoustic Phonetics and Speech Synthesis
Y
Ya Li
Beijing University of Posts and Telecommunications, Beijing, China