ESPnet-Codec: Comprehensive Training and Evaluation of Neural Codecs For Audio, Music, and Speech

📅 2024-09-24

🏛️ Spoken Language Technology Workshop

📈 Citations: 3

✨ Influential: 0

career value

211K/year

🤖 AI Summary

To address the challenges of incomparable cross-task performance and narrow evaluation dimensions in neural audio, music, and speech coding, this paper introduces the first open-source unified training and evaluation platform. Methodologically, it proposes ESPnet-Codec—an integrated codec framework—and VERSA—a standalone evaluation toolkit—supporting mainstream models (e.g., SoundStream, EnCodec, DAC) with discrete quantization, residual vector quantization (RVQ), and multi-scale adversarial training. It enables fully automated assessment across 20 objective audio metrics and seamless integration with six ESPnet downstream tasks. The key contributions are: (1) establishing the first cross-modal neural codec benchmark; (2) significantly improving training efficiency and interoperability for downstream applications (e.g., TTS, music generation); and (3) achieving state-of-the-art performance on both objective metrics and subjective MOS scores, with full reproducibility.

Technology Category

Application Category

📝 Abstract

Neural codecs have become crucial to recent speech and audio generation research. In addition to signal compression capabilities, discrete codecs have also been found to enhance downstream training efficiency and compatibility with autoregressive language models. However, as extensive downstream applications are investigated, challenges have arisen in ensuring fair comparisons across diverse applications. To address these issues, we present a new open-source platform ESPnet-Codec, which is built on ESPnet and focuses on neural codec training and evaluation. ESPnet-Codec offers various recipes in audio, music, and speech for training and evaluation using several widely adopted codec models. Together with ESPnet-Codec, we present VERSA, a standalone evaluation toolkit, which provides a comprehensive evaluation of codec performance over 20 audio evaluation metrics. Notably, we demonstrate that ESPnet-Codec can be integrated into six ESPnet tasks, supporting diverse applications.

Problem

Research questions and friction points this paper is trying to address.

Neural codecs for audio, music, and speech

Challenges in fair comparisons across applications

Comprehensive evaluation of codec performance

Innovation

Methods, ideas, or system contributions that make the work stand out.

Open-source neural codec platform

Comprehensive audio evaluation toolkit

Integration with multiple ESPnet tasks

🔎 Similar Papers

No similar papers found.