🤖 AI Summary
To address the challenges of incomparable cross-task performance and narrow evaluation dimensions in neural audio, music, and speech coding, this paper introduces the first open-source unified training and evaluation platform. Methodologically, it proposes ESPnet-Codec—an integrated codec framework—and VERSA—a standalone evaluation toolkit—supporting mainstream models (e.g., SoundStream, EnCodec, DAC) with discrete quantization, residual vector quantization (RVQ), and multi-scale adversarial training. It enables fully automated assessment across 20 objective audio metrics and seamless integration with six ESPnet downstream tasks. The key contributions are: (1) establishing the first cross-modal neural codec benchmark; (2) significantly improving training efficiency and interoperability for downstream applications (e.g., TTS, music generation); and (3) achieving state-of-the-art performance on both objective metrics and subjective MOS scores, with full reproducibility.
📝 Abstract
Neural codecs have become crucial to recent speech and audio generation research. In addition to signal compression capabilities, discrete codecs have also been found to enhance downstream training efficiency and compatibility with autoregressive language models. However, as extensive downstream applications are investigated, challenges have arisen in ensuring fair comparisons across diverse applications. To address these issues, we present a new open-source platform ESPnet-Codec, which is built on ESPnet and focuses on neural codec training and evaluation. ESPnet-Codec offers various recipes in audio, music, and speech for training and evaluation using several widely adopted codec models. Together with ESPnet-Codec, we present VERSA, a standalone evaluation toolkit, which provides a comprehensive evaluation of codec performance over 20 audio evaluation metrics. Notably, we demonstrate that ESPnet-Codec can be integrated into six ESPnet tasks, supporting diverse applications.