ClapFM-EVC: High-Fidelity and Flexible Emotional Voice Conversion with Dual Control from Natural Language and Speech

📅 2025-05-20
📈 Citations: 0
Influential: 0
📄 PDF

career value

207K/year
🤖 AI Summary
This paper addresses high-fidelity, fine-grained controllable emotional voice conversion. We propose a novel method enabling dual-path control—via natural language prompts or reference speech—and continuous adjustment of emotional intensity. Our key contributions are: (1) the first emotion-aware contrastive language–audio pretraining model, EVC-CLAP, which enhances cross-modal emotional alignment; (2) FuEncoder, a speaker- and emotion-encoding module with adaptive intensity gating, enabling disentangled and controllable emotion intensity representation; and (3) an end-to-end flow-matching-based reconstruction framework integrating Phonetic PosteriorGrams and ASR-derived auxiliary representations. Comprehensive objective and subjective evaluations demonstrate state-of-the-art performance: MOS of 4.12, with significant improvements in emotion accuracy, naturalness, and controllability over prior methods.

Technology Category

Application Category

📝 Abstract
Despite great advances, achieving high-fidelity emotional voice conversion (EVC) with flexible and interpretable control remains challenging. This paper introduces ClapFM-EVC, a novel EVC framework capable of generating high-quality converted speech driven by natural language prompts or reference speech with adjustable emotion intensity. We first propose EVC-CLAP, an emotional contrastive language-audio pre-training model, guided by natural language prompts and categorical labels, to extract and align fine-grained emotional elements across speech and text modalities. Then, a FuEncoder with an adaptive intensity gate is presented to seamless fuse emotional features with Phonetic PosteriorGrams from a pre-trained ASR model. To further improve emotion expressiveness and speech naturalness, we propose a flow matching model conditioned on these captured features to reconstruct Mel-spectrogram of source speech. Subjective and objective evaluations validate the effectiveness of ClapFM-EVC.
Problem

Research questions and friction points this paper is trying to address.

Achieving high-fidelity emotional voice conversion with flexible control
Aligning emotional elements across speech and text modalities
Improving emotion expressiveness and speech naturalness in conversion
Innovation

Methods, ideas, or system contributions that make the work stand out.

Emotional contrastive pre-training for text-speech alignment
Adaptive intensity gate for feature fusion
Flow matching model for spectrogram reconstruction
🔎 Similar Papers
No similar papers found.