PROEMO: Prompt-Driven Text-to-Speech Synthesis Based on Emotion and Intensity Control

📅 2025-01-10
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address the insufficient fine-grained and disentangled controllability of emotion and intensity in text-to-speech (TTS), this paper proposes a prompt-driven multi-speaker speech synthesis framework. Methodologically, it introduces three key innovations: (1) emotion embedding integration, (2) an intensity-adjustable prompt mechanism, and (3) an LLM-guided end-to-end prosody control paradigm. Leveraging LLM-based prompt engineering, the framework ensures semantic fidelity during prosody generation, while cross-modal prompt alignment and multi-speaker acoustic modeling jointly enhance expressiveness. Experimental results demonstrate significant improvements in both subjective and objective evaluations: emotion naturalness achieves a MOS score of 4.21, intensity consistency is markedly enhanced, and emotion classification accuracy increases by 18.7%. Crucially, this work is the first to enable independent and precise control over both emotion type and intensity dimensions in TTS.

Technology Category

Application Category

📝 Abstract
Speech synthesis has significantly advanced from statistical methods to deep neural network architectures, leading to various text-to-speech (TTS) models that closely mimic human speech patterns. However, capturing nuances such as emotion and style in speech synthesis is challenging. To address this challenge, we introduce an approach centered on prompt-based emotion control. The proposed architecture incorporates emotion and intensity control across multi-speakers. Furthermore, we leverage large language models (LLMs) to manipulate speech prosody while preserving linguistic content. Using embedding emotional cues, regulating intensity levels, and guiding prosodic variations with prompts, our approach infuses synthesized speech with human-like expressiveness and variability. Lastly, we demonstrate the effectiveness of our approach through a systematic exploration of the control mechanisms mentioned above.
Problem

Research questions and friction points this paper is trying to address.

Emotion Recognition
Speech Synthesis
Natural Language Processing
Innovation

Methods, ideas, or system contributions that make the work stand out.

Emotion Control
Prompt-based Approach
Large-scale Language Model
🔎 Similar Papers
No similar papers found.
S
Shaozuo Zhang
Singapore University of Technology and Design, Singapore
Ambuj Mehrish
Ambuj Mehrish
Research Fellow, Singapore University of Technology and Design, Singapore
Signal ProcessingMultimedia ForensicsSpeech and Language ProcessingDeep Learning
Y
Yingting Li
Beijing University of Posts and Telecommunications, China
S
Soujanya Poria
Singapore University of Technology and Design, Singapore