MeanVC: Lightweight and Streaming Zero-Shot Voice Conversion via Mean Flows

📅 2025-10-09
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Zero-shot voice conversion (VC) urgently requires simultaneous support for streaming inference, model lightweighting, and high-fidelity synthesis. Existing autoregressive (AR) or non-autoregressive (NAR) approaches struggle to balance generalization, computational efficiency, and audio quality. This paper proposes a lightweight streaming zero-shot VC framework: it models the average velocity field of speech feature trajectories via mean flows, enabling high-quality single-step sampling; integrates diffusion Transformers with block-wise autoregressive denoising to jointly harness AR’s temporal modeling capability and NAR’s parallel efficiency; and employs diffusion-based adversarial post-training to suppress oversmoothing. The resulting model achieves significant parameter reduction while maintaining superior speech naturalness and speaker similarity under real-time inference constraints. Extensive evaluations demonstrate that our method outperforms existing streaming zero-shot VC approaches in overall performance.

Technology Category

Application Category

📝 Abstract
Zero-shot voice conversion (VC) aims to transfer timbre from a source speaker to any unseen target speaker while preserving linguistic content. Growing application scenarios demand models with streaming inference capabilities. This has created a pressing need for models that are simultaneously fast, lightweight, and high-fidelity. However, existing streaming methods typically rely on either autoregressive (AR) or non-autoregressive (NAR) frameworks, which either require large parameter sizes to achieve strong performance or struggle to generalize to unseen speakers. In this study, we propose MeanVC, a lightweight and streaming zero-shot VC approach. MeanVC introduces a diffusion transformer with a chunk-wise autoregressive denoising strategy, combining the strengths of both AR and NAR paradigms for efficient streaming processing. By introducing mean flows, MeanVC regresses the average velocity field during training, enabling zero-shot VC with superior speech quality and speaker similarity in a single sampling step by directly mapping from the start to the endpoint of the flow trajectory. Additionally, we incorporate diffusion adversarial post-training to mitigate over-smoothing and further enhance speech quality. Experimental results demonstrate that MeanVC significantly outperforms existing zero-shot streaming VC systems, achieving superior conversion quality with higher efficiency and significantly fewer parameters. Audio demos and code are publicly available at https://aslp-lab.github.io/MeanVC.
Problem

Research questions and friction points this paper is trying to address.

Achieving efficient zero-shot voice conversion for unseen speakers
Enabling lightweight streaming inference with high-fidelity performance
Overcoming limitations of existing autoregressive and non-autoregressive frameworks
Innovation

Methods, ideas, or system contributions that make the work stand out.

Mean flows enable single-step voice conversion
Chunk-wise autoregressive denoising for streaming processing
Diffusion adversarial post-training enhances speech quality
🔎 Similar Papers
No similar papers found.
Guobin Ma
Guobin Ma
Northwestern Polytechnical University
J
Jixun Yao
Audio, Speech and Language Processing Group (ASLP@NPU), School of Computer Science, Northwestern Polytechnical University, Xi’an, China
Z
Ziqian Ning
Audio, Speech and Language Processing Group (ASLP@NPU), School of Computer Science, Northwestern Polytechnical University, Xi’an, China
Yuepeng Jiang
Yuepeng Jiang
Northwestern Polytechnical University
Speech ProcessingSpeech SynthesisVoice Conversion
L
Lingxin Xiong
Geely Automobile Research Institute (Ningbo) Company Ltd, Ningbo, China
L
Lei Xie
Audio, Speech and Language Processing Group (ASLP@NPU), School of Computer Science, Northwestern Polytechnical University, Xi’an, China
Pengcheng Zhu
Pengcheng Zhu
Fuxi AI Lab, NetEase Inc.
speech synthesissinging voice synthesistalking avatarvoice conversion