Fine-Tuning Text-to-Speech Diffusion Models Using Reinforcement Learning with Human Feedback

πŸ“… 2025-08-05
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
Diffusion-based text-to-speech (TTS) synthesis faces key bottlenecks: excessive denoising steps, weak prosody modeling, and poor real-time performance. To address these, we propose DLPOβ€”a novel framework that, for the first time, explicitly incorporates the diffusion model’s training loss into a human-feedback-driven reinforcement learning (RLHF) reward function. This enables end-to-end optimization with alignment between policy and model architecture while preserving the original generative capability. DLPO integrates WaveGrad 2, non-autoregressive synthesis, and a naturalness-scoring feedback mechanism. Experiments demonstrate substantial improvements: objective metrics reach UTMOS 3.65 and NISQA 4.02; subjective preference rate increases to 67%. The method significantly enhances speech naturalness and synthesis efficiency, establishing a new paradigm for real-time, high-fidelity TTS.

Technology Category

Application Category

πŸ“ Abstract
Diffusion models produce high-fidelity speech but are inefficient for real-time use due to long denoising steps and challenges in modeling intonation and rhythm. To improve this, we propose Diffusion Loss-Guided Policy Optimization (DLPO), an RLHF framework for TTS diffusion models. DLPO integrates the original training loss into the reward function, preserving generative capabilities while reducing inefficiencies. Using naturalness scores as feedback, DLPO aligns reward optimization with the diffusion model's structure, improving speech quality. We evaluate DLPO on WaveGrad 2, a non-autoregressive diffusion-based TTS model. Results show significant improvements in objective metrics (UTMOS 3.65, NISQA 4.02) and subjective evaluations, with DLPO audio preferred 67% of the time. These findings demonstrate DLPO's potential for efficient, high-quality diffusion TTS in real-time, resource-limited settings.
Problem

Research questions and friction points this paper is trying to address.

Improve real-time efficiency of diffusion-based TTS models
Enhance speech quality and naturalness using RLHF
Optimize TTS performance for resource-limited settings
Innovation

Methods, ideas, or system contributions that make the work stand out.

RLHF framework for TTS diffusion models
Integrates training loss into reward function
Uses naturalness scores as feedback
πŸ”Ž Similar Papers
No similar papers found.