RVCBench: Benchmarking the Robustness of Voice Cloning Across Modern Audio Generation Models

📅 2026-01-31

📈 Citations: 0

✨ Influential: 0

career value

214K/year

🤖 AI Summary

This work addresses the lack of systematic evaluation of robustness in existing voice cloning models under realistic perturbations such as noisy reference audio, mismatched text prompts, and downstream processing. We present the first comprehensive benchmark covering the entire voice cloning pipeline, structured around four dimensions: input variations, generation challenges, output post-processing, and adversarial perturbations. The benchmark encompasses 10 tasks, 225 speakers, 14,370 audio samples, and 11 state-of-the-art models. Leveraging autoregressive encoder-decoder and diffusion architectures, combined with multi-dimensional perturbation injection and automated metrics, our study systematically reveals performance degradation under common distribution shifts, long-context inputs, cross-lingual scenarios, and noise or adversarial attacks. We also release an open-source, standardized evaluation platform to advance deployable voice cloning technologies.

Technology Category

Application Category

📝 Abstract

Modern voice cloning (VC) can synthesize speech that closely matches a target speaker from only seconds of reference audio, enabling applications such as personalized speech interfaces and dubbing. In practical deployments, modern audio generation models inevitably encounter noisy reference audios, imperfect text prompts, and diverse downstream processing, which can significantly hurt robustness. Despite rapid progress in VC driven by autoregressive codec-token language models and diffusion-based models, robustness under realistic deployment shifts remains underexplored. This paper introduces RVCBench, a comprehensive benchmark that evaluates Robustness in VC across the full generation pipeline, including input variation, generation challenges, output post-processing, and adversarial perturbations, covering 10 robustness tasks, 225 speakers, 14,370 utterances, and 11 representative modern VC models. Our evaluation uncovers substantial robustness gaps in VC: performance can deteriorate sharply under common input shifts and post-processing; long-context and cross-lingual scenarios further expose stability limitations; and both passive noise and proactive perturbation influence generation robustness. Collectively, these findings provide a unified picture of how current VC models fail in practice and introduce a standardized, open-source testbed to support the development of more robust and deployable VC models. We open-source our project at https://github.com/Nanboy-Ronan/RVCBench.

Problem

Research questions and friction points this paper is trying to address.

voice cloning

robustness

audio generation

deployment shifts

benchmarking

Innovation

Methods, ideas, or system contributions that make the work stand out.

voice cloning

robustness benchmark

audio generation