MeanVC: Lightweight and Streaming Zero-Shot Voice Conversion via Mean Flows

📅 2025-10-09

📈 Citations: 0

✨ Influential: 0

career value

190K/year

🤖 AI Summary

Zero-shot voice conversion (VC) urgently requires simultaneous support for streaming inference, model lightweighting, and high-fidelity synthesis. Existing autoregressive (AR) or non-autoregressive (NAR) approaches struggle to balance generalization, computational efficiency, and audio quality. This paper proposes a lightweight streaming zero-shot VC framework: it models the average velocity field of speech feature trajectories via mean flows, enabling high-quality single-step sampling; integrates diffusion Transformers with block-wise autoregressive denoising to jointly harness AR’s temporal modeling capability and NAR’s parallel efficiency; and employs diffusion-based adversarial post-training to suppress oversmoothing. The resulting model achieves significant parameter reduction while maintaining superior speech naturalness and speaker similarity under real-time inference constraints. Extensive evaluations demonstrate that our method outperforms existing streaming zero-shot VC approaches in overall performance.

Technology Category

Application Category

📝 Abstract

Zero-shot voice conversion (VC) aims to transfer timbre from a source speaker to any unseen target speaker while preserving linguistic content. Growing application scenarios demand models with streaming inference capabilities. This has created a pressing need for models that are simultaneously fast, lightweight, and high-fidelity. However, existing streaming methods typically rely on either autoregressive (AR) or non-autoregressive (NAR) frameworks, which either require large parameter sizes to achieve strong performance or struggle to generalize to unseen speakers. In this study, we propose MeanVC, a lightweight and streaming zero-shot VC approach. MeanVC introduces a diffusion transformer with a chunk-wise autoregressive denoising strategy, combining the strengths of both AR and NAR paradigms for efficient streaming processing. By introducing mean flows, MeanVC regresses the average velocity field during training, enabling zero-shot VC with superior speech quality and speaker similarity in a single sampling step by directly mapping from the start to the endpoint of the flow trajectory. Additionally, we incorporate diffusion adversarial post-training to mitigate over-smoothing and further enhance speech quality. Experimental results demonstrate that MeanVC significantly outperforms existing zero-shot streaming VC systems, achieving superior conversion quality with higher efficiency and significantly fewer parameters. Audio demos and code are publicly available at https://aslp-lab.github.io/MeanVC.

Problem

Research questions and friction points this paper is trying to address.

Achieving efficient zero-shot voice conversion for unseen speakers

Enabling lightweight streaming inference with high-fidelity performance

Overcoming limitations of existing autoregressive and non-autoregressive frameworks

Innovation

Methods, ideas, or system contributions that make the work stand out.

Mean flows enable single-step voice conversion

Chunk-wise autoregressive denoising for streaming processing

Diffusion adversarial post-training enhances speech quality

🔎 Similar Papers

No similar papers found.