🤖 AI Summary
Zero-shot voice conversion (VC) urgently requires simultaneous support for streaming inference, model lightweighting, and high-fidelity synthesis. Existing autoregressive (AR) or non-autoregressive (NAR) approaches struggle to balance generalization, computational efficiency, and audio quality. This paper proposes a lightweight streaming zero-shot VC framework: it models the average velocity field of speech feature trajectories via mean flows, enabling high-quality single-step sampling; integrates diffusion Transformers with block-wise autoregressive denoising to jointly harness AR’s temporal modeling capability and NAR’s parallel efficiency; and employs diffusion-based adversarial post-training to suppress oversmoothing. The resulting model achieves significant parameter reduction while maintaining superior speech naturalness and speaker similarity under real-time inference constraints. Extensive evaluations demonstrate that our method outperforms existing streaming zero-shot VC approaches in overall performance.
📝 Abstract
Zero-shot voice conversion (VC) aims to transfer timbre from a source speaker to any unseen target speaker while preserving linguistic content. Growing application scenarios demand models with streaming inference capabilities. This has created a pressing need for models that are simultaneously fast, lightweight, and high-fidelity. However, existing streaming methods typically rely on either autoregressive (AR) or non-autoregressive (NAR) frameworks, which either require large parameter sizes to achieve strong performance or struggle to generalize to unseen speakers. In this study, we propose MeanVC, a lightweight and streaming zero-shot VC approach. MeanVC introduces a diffusion transformer with a chunk-wise autoregressive denoising strategy, combining the strengths of both AR and NAR paradigms for efficient streaming processing. By introducing mean flows, MeanVC regresses the average velocity field during training, enabling zero-shot VC with superior speech quality and speaker similarity in a single sampling step by directly mapping from the start to the endpoint of the flow trajectory. Additionally, we incorporate diffusion adversarial post-training to mitigate over-smoothing and further enhance speech quality. Experimental results demonstrate that MeanVC significantly outperforms existing zero-shot streaming VC systems, achieving superior conversion quality with higher efficiency and significantly fewer parameters. Audio demos and code are publicly available at https://aslp-lab.github.io/MeanVC.