PROEMO: Prompt-Driven Text-to-Speech Synthesis Based on Emotion and Intensity Control

📅 2025-01-10

📈 Citations: 0

✨ Influential: 0

career value

200K/year

🤖 AI Summary

To address the insufficient fine-grained and disentangled controllability of emotion and intensity in text-to-speech (TTS), this paper proposes a prompt-driven multi-speaker speech synthesis framework. Methodologically, it introduces three key innovations: (1) emotion embedding integration, (2) an intensity-adjustable prompt mechanism, and (3) an LLM-guided end-to-end prosody control paradigm. Leveraging LLM-based prompt engineering, the framework ensures semantic fidelity during prosody generation, while cross-modal prompt alignment and multi-speaker acoustic modeling jointly enhance expressiveness. Experimental results demonstrate significant improvements in both subjective and objective evaluations: emotion naturalness achieves a MOS score of 4.21, intensity consistency is markedly enhanced, and emotion classification accuracy increases by 18.7%. Crucially, this work is the first to enable independent and precise control over both emotion type and intensity dimensions in TTS.

Technology Category

Application Category

📝 Abstract

Speech synthesis has significantly advanced from statistical methods to deep neural network architectures, leading to various text-to-speech (TTS) models that closely mimic human speech patterns. However, capturing nuances such as emotion and style in speech synthesis is challenging. To address this challenge, we introduce an approach centered on prompt-based emotion control. The proposed architecture incorporates emotion and intensity control across multi-speakers. Furthermore, we leverage large language models (LLMs) to manipulate speech prosody while preserving linguistic content. Using embedding emotional cues, regulating intensity levels, and guiding prosodic variations with prompts, our approach infuses synthesized speech with human-like expressiveness and variability. Lastly, we demonstrate the effectiveness of our approach through a systematic exploration of the control mechanisms mentioned above.

Problem

Research questions and friction points this paper is trying to address.

Emotion Recognition

Speech Synthesis

Natural Language Processing

Innovation

Methods, ideas, or system contributions that make the work stand out.

Emotion Control

Prompt-based Approach

Large-scale Language Model

🔎 Similar Papers

No similar papers found.