PromptEVC: Controllable Emotional Voice Conversion with Natural Language Prompts

📅 2025-05-27
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing emotional voice conversion methods rely on predefined labels, reference audio, or fixed factors, limiting their ability to model individualized emotional perception and expression. This paper introduces the first natural language prompt-based emotional voice conversion framework, enabling fine-grained, personalized emotion control without explicit labels or reference speech. Our method comprises two core innovations: (1) a joint emotion descriptor and prompt mapper module that explicitly learns the mapping from textual prompts to disentangled emotional representations; and (2) an end-to-end pipeline supporting mixed-emotion synthesis and continuous intensity control, integrating reference embeddings, prosody modeling, and speaker identity encoding. Extensive experiments demonstrate that our approach significantly outperforms state-of-the-art methods in emotion recognition accuracy, intensity control fidelity, naturalness of mixed-emotion synthesis, and prosodic controllability.

Technology Category

Application Category

📝 Abstract
Controllable emotional voice conversion (EVC) aims to manipulate emotional expressions to increase the diversity of synthesized speech. Existing methods typically rely on predefined labels, reference audios, or prespecified factor values, often overlooking individual differences in emotion perception and expression. In this paper, we introduce PromptEVC that utilizes natural language prompts for precise and flexible emotion control. To bridge text descriptions with emotional speech, we propose emotion descriptor and prompt mapper to generate fine-grained emotion embeddings, trained jointly with reference embeddings. To enhance naturalness, we present a prosody modeling and control pipeline that adjusts the rhythm based on linguistic content and emotional cues. Additionally, a speaker encoder is incorporated to preserve identity. Experimental results demonstrate that PromptEVC outperforms state-of-the-art controllable EVC methods in emotion conversion, intensity control, mixed emotion synthesis, and prosody manipulation. Speech samples are available at https://jeremychee4.github.io/PromptEVC/.
Problem

Research questions and friction points this paper is trying to address.

Enables precise emotional voice conversion using natural language prompts
Addresses individual differences in emotion perception and expression
Improves naturalness via prosody modeling and speaker identity preservation
Innovation

Methods, ideas, or system contributions that make the work stand out.

Natural language prompts for emotion control
Fine-grained emotion embeddings generation
Prosody modeling and control pipeline
🔎 Similar Papers
No similar papers found.
Tianhua Qi
Tianhua Qi
Southeast University
affective computingspeech processing
S
Shiyan Wang
Key Laboratory of Child Development and Learning Science (Southeast University), Ministry of Education, Nanjing 210096, China; School of Biological Science and Medical Engineering, Southeast University, China
C
Cheng Lu
Key Laboratory of Child Development and Learning Science (Southeast University), Ministry of Education, Nanjing 210096, China; School of Biological Science and Medical Engineering, Southeast University, China
Tengfei Song
Tengfei Song
Huawei
Emotion recognitionComputer visionGraph neural network
H
Hao Yang
Huawei Translation Service Center, China
Zhanglin Wu
Zhanglin Wu
2012 Lab, Huawei Co. LTD
Machine TranslationNatural Language Processing
Wenming Zheng
Wenming Zheng
Southeast University
Affective ComputingPattern RecognitionComputer Vision