MBCodec:Thorough disentangle for high-fidelity audio compression

📅 2025-09-21

📈 Citations: 0

✨ Influential: 0

career value

225K/year

🤖 AI Summary

To address the loss of fine-grained acoustic details in high-fidelity neural audio codecs caused by entanglement of semantic and acoustic information, this paper proposes a functionally decoupled residual vector quantization (RVQ) multi-codebook codec. Methodologically, it constructs a latent space jointly driven by self-supervised semantic tokenization and subband feature extraction, incorporating adaptive codebook dropout and multi-channel pseudo-quadrature mirror filtering (PQMF) to enable hierarchical and fine-grained separation and joint modeling of information. Innovatively integrating semantic tokenization, hierarchical codebook training, and subband constraints derived from the raw waveform, the approach substantially enhances discrete representation capability. Experiments demonstrate that the model achieves a 170× compression ratio (2.2 kbps) on 24 kHz audio while attaining near-lossless reconstruction quality, consistently outperforming state-of-the-art baselines across objective metrics including MOS, STOI, and ESTOI.

Technology Category

Application Category

📝 Abstract

High-fidelity neural audio codecs in Text-to-speech (TTS) aim to compress speech signals into discrete representations for faithful reconstruction. However, prior approaches faced challenges in effectively disentangling acoustic and semantic information within tokens, leading to a lack of fine-grained details in synthesized speech. In this study, we propose MBCodec, a novel multi-codebook audio codec based on Residual Vector Quantization (RVQ) that learns a hierarchically structured representation. MBCodec leverages self-supervised semantic tokenization and audio subband features from the raw signals to construct a functionally-disentangled latent space. In order to encourage comprehensive learning across various layers of the codec embedding space, we introduce adaptive dropout depths to differentially train codebooks across layers, and employ a multi-channel pseudo-quadrature mirror filter (PQMF) during training. By thoroughly decoupling semantic and acoustic features, our method not only achieves near-lossless speech reconstruction but also enables a remarkable 170x compression of 24 kHz audio, resulting in a low bit rate of just 2.2 kbps. Experimental evaluations confirm its consistent and substantial outperformance of baselines across all evaluations.

Problem

Research questions and friction points this paper is trying to address.

Disentangling acoustic and semantic information in audio compression

Achieving high-fidelity speech reconstruction with fine details

Enabling extreme audio compression with minimal quality loss

Innovation

Methods, ideas, or system contributions that make the work stand out.

Hierarchical residual vector quantization codec

Self-supervised semantic and subband feature tokenization

Adaptive dropout and multi-channel PQMF training

🔎 Similar Papers

Learning Source Disentanglement in Neural Audio Codec