Enhancing Expressive Voice Conversion with Discrete Pitch-Conditioned Flow Matching Model

📅 2025-02-08
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address insufficient prosody and emotion modeling in voice conversion, this paper proposes an expressive timbre-style co-conversion framework. Methodologically, it introduces (1) a novel pitch-conditioned flow matching model jointly driven by discretized pitch tokens and target-speaker prompts; (2) a self-supervised VQ-VAE pretraining scheme to learn discrete pitch representations; and (3) the integration of global timbre embeddings with time-varying timbre tokens to enhance timbre fidelity. Experiments on LibriTTS and ESD demonstrate significant improvements in timbre similarity and stylistic expressiveness. The framework achieves superior performance over state-of-the-art methods in both zero-shot speaker conversion and cross-emotion transfer tasks, confirming its robustness and generalizability.

Technology Category

Application Category

📝 Abstract
This paper introduces PFlow-VC, a conditional flow matching voice conversion model that leverages fine-grained discrete pitch tokens and target speaker prompt information for expressive voice conversion (VC). Previous VC works primarily focus on speaker conversion, with further exploration needed in enhancing expressiveness (such as prosody and emotion) for timbre conversion. Unlike previous methods, we adopt a simple and efficient approach to enhance the style expressiveness of voice conversion models. Specifically, we pretrain a self-supervised pitch VQVAE model to discretize speaker-irrelevant pitch information and leverage a masked pitch-conditioned flow matching model for Mel-spectrogram synthesis, which provides in-context pitch modeling capabilities for the speaker conversion model, effectively improving the voice style transfer capacity. Additionally, we improve timbre similarity by combining global timbre embeddings with time-varying timbre tokens. Experiments on unseen LibriTTS test-clean and emotional speech dataset ESD show the superiority of the PFlow-VC model in both timbre conversion and style transfer. Audio samples are available on the demo page https://speechai-demo.github.io/PFlow-VC/.
Problem

Research questions and friction points this paper is trying to address.

Enhancing voice expressiveness with pitch-conditioned flow
Improving timbre similarity using dynamic embeddings
Advancing style transfer in voice conversion models
Innovation

Methods, ideas, or system contributions that make the work stand out.

Discrete pitch tokens enhance expressiveness
Masked pitch-conditioned flow matching model
Global and time-varying timbre embeddings improve similarity
🔎 Similar Papers
No similar papers found.