MBCodec:Thorough disentangle for high-fidelity audio compression

📅 2025-09-21
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address the loss of fine-grained acoustic details in high-fidelity neural audio codecs caused by entanglement of semantic and acoustic information, this paper proposes a functionally decoupled residual vector quantization (RVQ) multi-codebook codec. Methodologically, it constructs a latent space jointly driven by self-supervised semantic tokenization and subband feature extraction, incorporating adaptive codebook dropout and multi-channel pseudo-quadrature mirror filtering (PQMF) to enable hierarchical and fine-grained separation and joint modeling of information. Innovatively integrating semantic tokenization, hierarchical codebook training, and subband constraints derived from the raw waveform, the approach substantially enhances discrete representation capability. Experiments demonstrate that the model achieves a 170× compression ratio (2.2 kbps) on 24 kHz audio while attaining near-lossless reconstruction quality, consistently outperforming state-of-the-art baselines across objective metrics including MOS, STOI, and ESTOI.

Technology Category

Application Category

📝 Abstract
High-fidelity neural audio codecs in Text-to-speech (TTS) aim to compress speech signals into discrete representations for faithful reconstruction. However, prior approaches faced challenges in effectively disentangling acoustic and semantic information within tokens, leading to a lack of fine-grained details in synthesized speech. In this study, we propose MBCodec, a novel multi-codebook audio codec based on Residual Vector Quantization (RVQ) that learns a hierarchically structured representation. MBCodec leverages self-supervised semantic tokenization and audio subband features from the raw signals to construct a functionally-disentangled latent space. In order to encourage comprehensive learning across various layers of the codec embedding space, we introduce adaptive dropout depths to differentially train codebooks across layers, and employ a multi-channel pseudo-quadrature mirror filter (PQMF) during training. By thoroughly decoupling semantic and acoustic features, our method not only achieves near-lossless speech reconstruction but also enables a remarkable 170x compression of 24 kHz audio, resulting in a low bit rate of just 2.2 kbps. Experimental evaluations confirm its consistent and substantial outperformance of baselines across all evaluations.
Problem

Research questions and friction points this paper is trying to address.

Disentangling acoustic and semantic information in audio compression
Achieving high-fidelity speech reconstruction with fine details
Enabling extreme audio compression with minimal quality loss
Innovation

Methods, ideas, or system contributions that make the work stand out.

Hierarchical residual vector quantization codec
Self-supervised semantic and subband feature tokenization
Adaptive dropout and multi-channel PQMF training
🔎 Similar Papers
No similar papers found.
R
Ruonan Zhang
Tsinghua University
Xiaoyang Hao
Xiaoyang Hao
Tencent
speech synthesis
Y
Yichen Han
AMAP Speech
Junjie Cao
Junjie Cao
School of Mathematical Sciences, Dalian University of Technology
Computer GraphicsComputer VisionMachine Learning
Y
Yue Liu
AMAP Speech
K
Kai Zhang
Tsinghua University