ClapFM-EVC: High-Fidelity and Flexible Emotional Voice Conversion with Dual Control from Natural Language and Speech

📅 2025-05-20
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This paper addresses high-fidelity, fine-grained controllable emotional voice conversion. We propose a novel method enabling dual-path control—via natural language prompts or reference speech—and continuous adjustment of emotional intensity. Our key contributions are: (1) the first emotion-aware contrastive language–audio pretraining model, EVC-CLAP, which enhances cross-modal emotional alignment; (2) FuEncoder, a speaker- and emotion-encoding module with adaptive intensity gating, enabling disentangled and controllable emotion intensity representation; and (3) an end-to-end flow-matching-based reconstruction framework integrating Phonetic PosteriorGrams and ASR-derived auxiliary representations. Comprehensive objective and subjective evaluations demonstrate state-of-the-art performance: MOS of 4.12, with significant improvements in emotion accuracy, naturalness, and controllability over prior methods.

Technology Category

Application Category

📝 Abstract
Despite great advances, achieving high-fidelity emotional voice conversion (EVC) with flexible and interpretable control remains challenging. This paper introduces ClapFM-EVC, a novel EVC framework capable of generating high-quality converted speech driven by natural language prompts or reference speech with adjustable emotion intensity. We first propose EVC-CLAP, an emotional contrastive language-audio pre-training model, guided by natural language prompts and categorical labels, to extract and align fine-grained emotional elements across speech and text modalities. Then, a FuEncoder with an adaptive intensity gate is presented to seamless fuse emotional features with Phonetic PosteriorGrams from a pre-trained ASR model. To further improve emotion expressiveness and speech naturalness, we propose a flow matching model conditioned on these captured features to reconstruct Mel-spectrogram of source speech. Subjective and objective evaluations validate the effectiveness of ClapFM-EVC.
Problem

Research questions and friction points this paper is trying to address.

Achieving high-fidelity emotional voice conversion with flexible control
Aligning emotional elements across speech and text modalities
Improving emotion expressiveness and speech naturalness in conversion
Innovation

Methods, ideas, or system contributions that make the work stand out.

Emotional contrastive pre-training for text-speech alignment
Adaptive intensity gate for feature fusion
Flow matching model for spectrogram reconstruction
🔎 Similar Papers
No similar papers found.
Y
Yu Pan
Department of Information Science and Technology, Kyushu University, Japan
Y
Yanni Hu
EverestAI, Ximalaya Inc., China
Yuguang Yang
Yuguang Yang
Microsoft, Amazon Alexa AI, Tsinghua University, Johns Hopkins University
Artificial IntelligenceNatural Language ProcessingStochastic Process & ControlComputational Physics
J
Jixun Yao
EverestAI, Ximalaya Inc., China
J
Jianhao Ye
EverestAI, Ximalaya Inc., China
Hongbin Zhou
Hongbin Zhou
Shanghai AI Laboratory
L
Lei Ma
Department of Computer Science, The University of Tokyo, Japan
Jianjun Zhao
Jianjun Zhao
Kyushu University
Software EngineeringProgramming Languages