EmoVoice: LLM-based Emotional Text-To-Speech Model with Freestyle Text Prompting

📅 2025-04-17
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Current TTS models struggle to achieve fine-grained emotional control via natural language prompts. To address this, we propose the first LLM-driven free-text emotion prompting framework and introduce Phoneme Boost—a phoneme-audio parallel generation architecture. Our method integrates Chain-of-Thought (CoT) and Modality-of-Thought (CoM) reasoning to enhance semantic-acoustic alignment. We also release EmoVoice-DB, the first high-quality 40-hour speech dataset with natural language emotion annotations, enabling unconstrained emotion descriptions (e.g., “tired yet tender”) without hand-crafted discrete labels. Trained solely on synthetic data, our model achieves state-of-the-art performance on the English EmoVoice-DB test set and demonstrates strong cross-lingual generalization, significantly outperforming baselines on the Chinese Secap dataset. Evaluation by multimodal foundation models (GPT-4o-audio, Gemini) aligns closely with human listening assessments, validating both effectiveness and robustness across languages and evaluation paradigms.

Technology Category

Application Category

📝 Abstract
Human speech goes beyond the mere transfer of information; it is a profound exchange of emotions and a connection between individuals. While Text-to-Speech (TTS) models have made huge progress, they still face challenges in controlling the emotional expression in the generated speech. In this work, we propose EmoVoice, a novel emotion-controllable TTS model that exploits large language models (LLMs) to enable fine-grained freestyle natural language emotion control, and a phoneme boost variant design that makes the model output phoneme tokens and audio tokens in parallel to enhance content consistency, inspired by chain-of-thought (CoT) and modality-of-thought (CoM) techniques. Besides, we introduce EmoVoice-DB, a high-quality 40-hour English emotion dataset featuring expressive speech and fine-grained emotion labels with natural language descriptions. EmoVoice achieves state-of-the-art performance on the English EmoVoice-DB test set using only synthetic training data, and on the Chinese Secap test set using our in-house data. We further investigate the reliability of existing emotion evaluation metrics and their alignment with human perceptual preferences, and explore using SOTA multimodal LLMs GPT-4o-audio and Gemini to assess emotional speech. Demo samples are available at https://anonymous.4open.science/r/EmoVoice-DF55. Dataset, code, and checkpoints will be released.
Problem

Research questions and friction points this paper is trying to address.

Control emotional expression in Text-to-Speech models
Enhance content consistency with phoneme boost design
Evaluate emotion metrics alignment with human perception
Innovation

Methods, ideas, or system contributions that make the work stand out.

LLM-based freestyle natural language emotion control
Phoneme boost variant for content consistency
High-quality emotion dataset with natural language labels
🔎 Similar Papers
No similar papers found.
Guanrou Yang
Guanrou Yang
Shanghai Jiao Tong University
C
Chen Yang
Shanghai Jiao Tong University, Shanghai, China
Q
Qian Chen
Tongyi Speech Lab, Hangzhou, China
Z
Ziyang Ma
Shanghai Jiao Tong University, Shanghai, China
W
Wenxi Chen
Shanghai Jiao Tong University, Shanghai, China
W
Wen Wang
Tongyi Speech Lab, Hangzhou, China
Tianrui Wang
Tianrui Wang
Tianjin University
Speech Signal Processing
Y
Yifan Yang
Shanghai Jiao Tong University, Shanghai, China
Zhikang Niu
Zhikang Niu
Shanghai Jiao Tong University
Speech Synthesis
Wenrui Liu
Wenrui Liu
Zhejiang University
time seriesmulti-modalLLM
F
Fan Yu
Tongyi Speech Lab, Hangzhou, China
Zhihao Du
Zhihao Du
Alibaba
Speech separationspeech enchancementspeaker diarization
Z
Zhifu Gao
Tongyi Speech Lab, Hangzhou, China
ShiLiang Zhang
ShiLiang Zhang
Unknown affiliation
Deep LearningASR,TTS,LLM
X
Xie Chen
Shanghai Jiao Tong University, Shanghai, China