🤖 AI Summary
This paper addresses high-fidelity, fine-grained controllable emotional voice conversion. We propose a novel method enabling dual-path control—via natural language prompts or reference speech—and continuous adjustment of emotional intensity. Our key contributions are: (1) the first emotion-aware contrastive language–audio pretraining model, EVC-CLAP, which enhances cross-modal emotional alignment; (2) FuEncoder, a speaker- and emotion-encoding module with adaptive intensity gating, enabling disentangled and controllable emotion intensity representation; and (3) an end-to-end flow-matching-based reconstruction framework integrating Phonetic PosteriorGrams and ASR-derived auxiliary representations. Comprehensive objective and subjective evaluations demonstrate state-of-the-art performance: MOS of 4.12, with significant improvements in emotion accuracy, naturalness, and controllability over prior methods.
📝 Abstract
Despite great advances, achieving high-fidelity emotional voice conversion (EVC) with flexible and interpretable control remains challenging. This paper introduces ClapFM-EVC, a novel EVC framework capable of generating high-quality converted speech driven by natural language prompts or reference speech with adjustable emotion intensity. We first propose EVC-CLAP, an emotional contrastive language-audio pre-training model, guided by natural language prompts and categorical labels, to extract and align fine-grained emotional elements across speech and text modalities. Then, a FuEncoder with an adaptive intensity gate is presented to seamless fuse emotional features with Phonetic PosteriorGrams from a pre-trained ASR model. To further improve emotion expressiveness and speech naturalness, we propose a flow matching model conditioned on these captured features to reconstruct Mel-spectrogram of source speech. Subjective and objective evaluations validate the effectiveness of ClapFM-EVC.