MelCap: A Unified Single-Codebook Neural Codec for High-Fidelity Audio Compression

📅 2025-10-02

📈 Citations: 0

✨ Influential: 0

career value

181K/year

🤖 AI Summary

Existing neural audio codecs suffer from domain limitations—being optimized solely for speech—or exhibit poor downstream task compatibility due to multi-codebook designs. This paper proposes MelCap, the first unified, single-codebook neural audio codec capable of high-fidelity, low-bitrate compression and reconstruction across speech, music, and general audio. Its core innovations are: (1) a two-stage architecture employing a 2D tokenizer to discretize Mel-spectrograms; (2) a perceptual loss that effectively suppresses spectral smoothing artifacts; and (3) a lightweight vocoder enabling waveform synthesis in a single forward pass. Experiments demonstrate that MelCap matches state-of-the-art multi-codebook methods in both objective metrics (e.g., STOI, PESQ) and subjective MOS scores, while significantly improving decoding efficiency and compatibility with downstream tasks such as speech recognition and audio classification.

Technology Category

Application Category

📝 Abstract

Neural audio codecs have recently emerged as powerful tools for high-quality and low-bitrate audio compression, leveraging deep generative models to learn latent representations of audio signals. However, existing approaches either rely on a single quantizer that only processes speech domain, or on multiple quantizers that are not well suited for downstream tasks. To address this issue, we propose MelCap, a unified "one-codebook-for-all" neural codec that effectively handles speech, music, and general sound. By decomposing audio reconstruction into two stages, our method preserves more acoustic details than previous single-codebook approaches, while achieving performance comparable to mainstream multi-codebook methods. In the first stage, audio is transformed into mel-spectrograms, which are compressed and quantized into compact single tokens using a 2D tokenizer. A perceptual loss is further applied to mitigate the over-smoothing artifacts observed in spectrogram reconstruction. In the second stage, a Vocoder recovers waveforms from the mel discrete tokens in a single forward pass, enabling real-time decoding. Both objective and subjective evaluations demonstrate that MelCap achieves quality on comparable to state-of-the-art multi-codebook codecs, while retaining the computational simplicity of a single-codebook design, thereby providing an effective representation for downstream tasks.

Problem

Research questions and friction points this paper is trying to address.

Unified neural codec handles speech music general sound

Single codebook preserves acoustic details better than predecessors

Achieves multi codebook performance with simpler computational design

Innovation

Methods, ideas, or system contributions that make the work stand out.

Unified single-codebook neural codec for all audio types

Two-stage decomposition with mel-spectrogram compression and quantization

Vocoder enables real-time waveform reconstruction from tokens

🔎 Similar Papers

FlowMAC: Conditional Flow Matching for Audio Coding at Low Bit Rates