ESPnet-Codec: Comprehensive Training and Evaluation of Neural Codecs For Audio, Music, and Speech

📅 2024-09-24
🏛️ Spoken Language Technology Workshop
📈 Citations: 3
Influential: 0
📄 PDF
🤖 AI Summary
To address the challenges of incomparable cross-task performance and narrow evaluation dimensions in neural audio, music, and speech coding, this paper introduces the first open-source unified training and evaluation platform. Methodologically, it proposes ESPnet-Codec—an integrated codec framework—and VERSA—a standalone evaluation toolkit—supporting mainstream models (e.g., SoundStream, EnCodec, DAC) with discrete quantization, residual vector quantization (RVQ), and multi-scale adversarial training. It enables fully automated assessment across 20 objective audio metrics and seamless integration with six ESPnet downstream tasks. The key contributions are: (1) establishing the first cross-modal neural codec benchmark; (2) significantly improving training efficiency and interoperability for downstream applications (e.g., TTS, music generation); and (3) achieving state-of-the-art performance on both objective metrics and subjective MOS scores, with full reproducibility.

Technology Category

Application Category

📝 Abstract
Neural codecs have become crucial to recent speech and audio generation research. In addition to signal compression capabilities, discrete codecs have also been found to enhance downstream training efficiency and compatibility with autoregressive language models. However, as extensive downstream applications are investigated, challenges have arisen in ensuring fair comparisons across diverse applications. To address these issues, we present a new open-source platform ESPnet-Codec, which is built on ESPnet and focuses on neural codec training and evaluation. ESPnet-Codec offers various recipes in audio, music, and speech for training and evaluation using several widely adopted codec models. Together with ESPnet-Codec, we present VERSA, a standalone evaluation toolkit, which provides a comprehensive evaluation of codec performance over 20 audio evaluation metrics. Notably, we demonstrate that ESPnet-Codec can be integrated into six ESPnet tasks, supporting diverse applications.
Problem

Research questions and friction points this paper is trying to address.

Neural codecs for audio, music, and speech
Challenges in fair comparisons across applications
Comprehensive evaluation of codec performance
Innovation

Methods, ideas, or system contributions that make the work stand out.

Open-source neural codec platform
Comprehensive audio evaluation toolkit
Integration with multiple ESPnet tasks
🔎 Similar Papers
No similar papers found.
J
Jiatong Shi
Carnegie Mellon University
Jinchuan Tian
Jinchuan Tian
Language Technologies Institute, Carnegie Mellon University
Speech and Language Processing
Y
Yihan Wu
Carnegie Mellon University, Renmin University of China
Jee-weon Jung
Jee-weon Jung
Apple, Carnegie Mellon University
Speaker recognitionAnti-spoofingSpeaker diarizationSpeech processingDeep learning
J
J. Yip
Nanyang Technological University
Yoshiki Masuyama
Yoshiki Masuyama
Mitsubishi Electric Research Laboratories (MERL)
Audio Signal ProcessingSignal ProcessingMachine Learning
William Chen
William Chen
Carnegie Mellon University
Spoken Language ProcessingSpeech RecognitionSpeech TranslationMachine Translation
Yuning Wu
Yuning Wu
Wayne State University
perceptions of crime & justicepolice attitudes and behaviorsvictimizationcriminological theorieslaw and society
Y
Yuxun Tang
Renmin University of China
Massa Baali
Massa Baali
Carnegie Mellon University
Speech and Audio ProcessingDeep Learning
D
Dareen Alharthi
Carnegie Mellon University
D
Dong Zhang
University of Chicago
R
Ruifan Deng
University of Chicago
T
Tejes Srivastava
National Taiwan University
Haibin Wu
Haibin Wu
Meta
speech processingmulti-modalspeech synthesisLLM
Alexander H. Liu
Alexander H. Liu
Massachusetts Institute of Technology
Bhiksha Raj
Bhiksha Raj
Carnegie Mellon University
Deep LearningArtificial IntelligenceSpeech and Audio ProcessingSignal ProcessingMachine Learning
Qin Jin
Qin Jin
中国人民大学信息学院
人工智能
Ruihua Song
Ruihua Song
Renmin University of China
AI based creationmulti-modaltiy chitchatnatural language understandinginformation retrievalinformation extraction
Shinji Watanabe
Shinji Watanabe
Carnegie Mellon University
Speech recognitionSpeech processingSpeech enhancementSpeech translation