MelCap: A Unified Single-Codebook Neural Codec for High-Fidelity Audio Compression

📅 2025-10-02
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing neural audio codecs suffer from domain limitations—being optimized solely for speech—or exhibit poor downstream task compatibility due to multi-codebook designs. This paper proposes MelCap, the first unified, single-codebook neural audio codec capable of high-fidelity, low-bitrate compression and reconstruction across speech, music, and general audio. Its core innovations are: (1) a two-stage architecture employing a 2D tokenizer to discretize Mel-spectrograms; (2) a perceptual loss that effectively suppresses spectral smoothing artifacts; and (3) a lightweight vocoder enabling waveform synthesis in a single forward pass. Experiments demonstrate that MelCap matches state-of-the-art multi-codebook methods in both objective metrics (e.g., STOI, PESQ) and subjective MOS scores, while significantly improving decoding efficiency and compatibility with downstream tasks such as speech recognition and audio classification.

Technology Category

Application Category

📝 Abstract
Neural audio codecs have recently emerged as powerful tools for high-quality and low-bitrate audio compression, leveraging deep generative models to learn latent representations of audio signals. However, existing approaches either rely on a single quantizer that only processes speech domain, or on multiple quantizers that are not well suited for downstream tasks. To address this issue, we propose MelCap, a unified "one-codebook-for-all" neural codec that effectively handles speech, music, and general sound. By decomposing audio reconstruction into two stages, our method preserves more acoustic details than previous single-codebook approaches, while achieving performance comparable to mainstream multi-codebook methods. In the first stage, audio is transformed into mel-spectrograms, which are compressed and quantized into compact single tokens using a 2D tokenizer. A perceptual loss is further applied to mitigate the over-smoothing artifacts observed in spectrogram reconstruction. In the second stage, a Vocoder recovers waveforms from the mel discrete tokens in a single forward pass, enabling real-time decoding. Both objective and subjective evaluations demonstrate that MelCap achieves quality on comparable to state-of-the-art multi-codebook codecs, while retaining the computational simplicity of a single-codebook design, thereby providing an effective representation for downstream tasks.
Problem

Research questions and friction points this paper is trying to address.

Unified neural codec handles speech music general sound
Single codebook preserves acoustic details better than predecessors
Achieves multi codebook performance with simpler computational design
Innovation

Methods, ideas, or system contributions that make the work stand out.

Unified single-codebook neural codec for all audio types
Two-stage decomposition with mel-spectrogram compression and quantization
Vocoder enables real-time waveform reconstruction from tokens
🔎 Similar Papers
No similar papers found.
J
Jingyi Li
International Digital Economy Academy (IDEA)
Z
Zhiyuan Zhao
International Digital Economy Academy (IDEA)
Y
Yunfei Liu
International Digital Economy Academy (IDEA)
Lijian Lin
Lijian Lin
Tencent ARC Lab
Computer VisionVisual Tracking,Video Object Detection
Y
Ye Zhu
International Digital Economy Academy (IDEA)
Jiahao Wu
Jiahao Wu
The Chinese University of Hong Kong
Medical RobotsRobot-assisted MicrosurgeryMotion Planning
Qiuqiang Kong
Qiuqiang Kong
The Chinese University of Hong Kong
Audio ProcessingArtificial Intelligence
Y
Yu Li
International Digital Economy Academy (IDEA)