RLAIF-SPA: Optimizing LLM-based Emotional Speech Synthesis via RLAIF

📅 2025-10-16
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing emotional text-to-speech (TTS) approaches rely on manual emotion annotations or indirect optimization objectives, struggling to simultaneously ensure emotional naturalness and semantic fidelity. To address this, we propose RLAIF-SPA—a Reinforcement Learning with AI Feedback framework that jointly optimizes four fine-grained dimensions: semantic-emotional consistency (via large language models), intelligibility (via ASR models), and prosodic alignment (pitch and speaking rate via prosody labels), eliminating the need for manual emotion labeling. Evaluated on LibriSpeech, RLAIF-SPA reduces word error rate by 26.1% and improves objective similarity by 9.1% over Chat-TTS; human subjective ratings increase by over 10%. The framework significantly enhances the co-optimization of emotional expressiveness and speech quality.

Technology Category

Application Category

📝 Abstract
Text-To-Speech synthesis has achieved near-human quality in neutral speech, but emotional expressiveness remains a challenge. Existing methods often rely on costly emotion annotations or optimize indirect objectives that fail to capture the emotional expressiveness and perceptual naturalness of speech, leading to generated speech that is accurate but emotionally flat. To address these challenges, we propose the RLAIF-SPA framework, incorporating a Reinforcement Learning from AI Feedback (RLAIF) mechanism to employ Automatic Speech Recognition (ASR) and Large Language Model (LLM) techniques to respectively judge semantic accuracy and prosodic-emotional label alignment as a direct reward for emotional expressiveness and intelligibility optimization. Specifically, it leverages Prosodic Label Alignment to enhance expressive quality by jointly considering semantic accuracy and prosodic-emotional alignment along four fine-grained dimensions: Structure, Emotion, Speed, and Tone. In addition, it incorporates Semantic Accuracy Feedback to ensure the generation of clear and accurate speech. Experiments on the Libri Speech dataset show that RLAIF-SPA outperforms Chat-TTS, with a 26.1% reduction in WER, a 9.1% increase in SIM-O, and over 10% improvement in human evaluation.
Problem

Research questions and friction points this paper is trying to address.

Enhancing emotional expressiveness in text-to-speech synthesis
Reducing reliance on costly emotion annotations for speech generation
Improving semantic accuracy and prosodic alignment in emotional speech
Innovation

Methods, ideas, or system contributions that make the work stand out.

Uses RLAIF mechanism for emotional speech optimization
Leverages prosodic label alignment across four dimensions
Incorporates semantic accuracy feedback for intelligibility
🔎 Similar Papers
No similar papers found.
Q
Qing Yang
School of Computer Science and Engineering, Northeastern University, Shenyang, China
Zhenghao Liu
Zhenghao Liu
Northeastern University
NLPInformation Retrieval
J
Junxin Wang
School of Computer Science and Engineering, Northeastern University, Shenyang, China
Yangfan Du
Yangfan Du
Northeastern University, china
speech language processing
Pengcheng Huang
Pengcheng Huang
Computer Engineering Group, ETH Zurich
Intelligent Learning SystemsCyber Physical Systems
T
Tong Xiao
School of Computer Science and Engineering, Northeastern University, Shenyang, China; NiuTrans Research, Shenyang, China