🤖 AI Summary
Existing emotional voice conversion methods rely on predefined labels, reference audio, or fixed factors, limiting their ability to model individualized emotional perception and expression. This paper introduces the first natural language prompt-based emotional voice conversion framework, enabling fine-grained, personalized emotion control without explicit labels or reference speech. Our method comprises two core innovations: (1) a joint emotion descriptor and prompt mapper module that explicitly learns the mapping from textual prompts to disentangled emotional representations; and (2) an end-to-end pipeline supporting mixed-emotion synthesis and continuous intensity control, integrating reference embeddings, prosody modeling, and speaker identity encoding. Extensive experiments demonstrate that our approach significantly outperforms state-of-the-art methods in emotion recognition accuracy, intensity control fidelity, naturalness of mixed-emotion synthesis, and prosodic controllability.
📝 Abstract
Controllable emotional voice conversion (EVC) aims to manipulate emotional expressions to increase the diversity of synthesized speech. Existing methods typically rely on predefined labels, reference audios, or prespecified factor values, often overlooking individual differences in emotion perception and expression. In this paper, we introduce PromptEVC that utilizes natural language prompts for precise and flexible emotion control. To bridge text descriptions with emotional speech, we propose emotion descriptor and prompt mapper to generate fine-grained emotion embeddings, trained jointly with reference embeddings. To enhance naturalness, we present a prosody modeling and control pipeline that adjusts the rhythm based on linguistic content and emotional cues. Additionally, a speaker encoder is incorporated to preserve identity. Experimental results demonstrate that PromptEVC outperforms state-of-the-art controllable EVC methods in emotion conversion, intensity control, mixed emotion synthesis, and prosody manipulation. Speech samples are available at https://jeremychee4.github.io/PromptEVC/.