PromptEVC: Controllable Emotional Voice Conversion with Natural Language Prompts

📅 2025-05-27

📈 Citations: 0

✨ Influential: 0

career value

190K/year

🤖 AI Summary

Existing emotional voice conversion methods rely on predefined labels, reference audio, or fixed factors, limiting their ability to model individualized emotional perception and expression. This paper introduces the first natural language prompt-based emotional voice conversion framework, enabling fine-grained, personalized emotion control without explicit labels or reference speech. Our method comprises two core innovations: (1) a joint emotion descriptor and prompt mapper module that explicitly learns the mapping from textual prompts to disentangled emotional representations; and (2) an end-to-end pipeline supporting mixed-emotion synthesis and continuous intensity control, integrating reference embeddings, prosody modeling, and speaker identity encoding. Extensive experiments demonstrate that our approach significantly outperforms state-of-the-art methods in emotion recognition accuracy, intensity control fidelity, naturalness of mixed-emotion synthesis, and prosodic controllability.

Technology Category

Application Category

📝 Abstract

Controllable emotional voice conversion (EVC) aims to manipulate emotional expressions to increase the diversity of synthesized speech. Existing methods typically rely on predefined labels, reference audios, or prespecified factor values, often overlooking individual differences in emotion perception and expression. In this paper, we introduce PromptEVC that utilizes natural language prompts for precise and flexible emotion control. To bridge text descriptions with emotional speech, we propose emotion descriptor and prompt mapper to generate fine-grained emotion embeddings, trained jointly with reference embeddings. To enhance naturalness, we present a prosody modeling and control pipeline that adjusts the rhythm based on linguistic content and emotional cues. Additionally, a speaker encoder is incorporated to preserve identity. Experimental results demonstrate that PromptEVC outperforms state-of-the-art controllable EVC methods in emotion conversion, intensity control, mixed emotion synthesis, and prosody manipulation. Speech samples are available at https://jeremychee4.github.io/PromptEVC/.

Problem

Research questions and friction points this paper is trying to address.

Enables precise emotional voice conversion using natural language prompts

Addresses individual differences in emotion perception and expression

Improves naturalness via prosody modeling and speaker identity preservation

Innovation

Methods, ideas, or system contributions that make the work stand out.

Natural language prompts for emotion control

Fine-grained emotion embeddings generation

Prosody modeling and control pipeline

🔎 Similar Papers

No similar papers found.