🤖 AI Summary
Existing neural audio codecs suffer from domain limitations—being optimized solely for speech—or exhibit poor downstream task compatibility due to multi-codebook designs. This paper proposes MelCap, the first unified, single-codebook neural audio codec capable of high-fidelity, low-bitrate compression and reconstruction across speech, music, and general audio. Its core innovations are: (1) a two-stage architecture employing a 2D tokenizer to discretize Mel-spectrograms; (2) a perceptual loss that effectively suppresses spectral smoothing artifacts; and (3) a lightweight vocoder enabling waveform synthesis in a single forward pass. Experiments demonstrate that MelCap matches state-of-the-art multi-codebook methods in both objective metrics (e.g., STOI, PESQ) and subjective MOS scores, while significantly improving decoding efficiency and compatibility with downstream tasks such as speech recognition and audio classification.
📝 Abstract
Neural audio codecs have recently emerged as powerful tools for high-quality and low-bitrate audio compression, leveraging deep generative models to learn latent representations of audio signals. However, existing approaches either rely on a single quantizer that only processes speech domain, or on multiple quantizers that are not well suited for downstream tasks. To address this issue, we propose MelCap, a unified "one-codebook-for-all" neural codec that effectively handles speech, music, and general sound. By decomposing audio reconstruction into two stages, our method preserves more acoustic details than previous single-codebook approaches, while achieving performance comparable to mainstream multi-codebook methods. In the first stage, audio is transformed into mel-spectrograms, which are compressed and quantized into compact single tokens using a 2D tokenizer. A perceptual loss is further applied to mitigate the over-smoothing artifacts observed in spectrogram reconstruction. In the second stage, a Vocoder recovers waveforms from the mel discrete tokens in a single forward pass, enabling real-time decoding. Both objective and subjective evaluations demonstrate that MelCap achieves quality on comparable to state-of-the-art multi-codebook codecs, while retaining the computational simplicity of a single-codebook design, thereby providing an effective representation for downstream tasks.