🤖 AI Summary
To address the loss of fine-grained acoustic details in high-fidelity neural audio codecs caused by entanglement of semantic and acoustic information, this paper proposes a functionally decoupled residual vector quantization (RVQ) multi-codebook codec. Methodologically, it constructs a latent space jointly driven by self-supervised semantic tokenization and subband feature extraction, incorporating adaptive codebook dropout and multi-channel pseudo-quadrature mirror filtering (PQMF) to enable hierarchical and fine-grained separation and joint modeling of information. Innovatively integrating semantic tokenization, hierarchical codebook training, and subband constraints derived from the raw waveform, the approach substantially enhances discrete representation capability. Experiments demonstrate that the model achieves a 170× compression ratio (2.2 kbps) on 24 kHz audio while attaining near-lossless reconstruction quality, consistently outperforming state-of-the-art baselines across objective metrics including MOS, STOI, and ESTOI.
📝 Abstract
High-fidelity neural audio codecs in Text-to-speech (TTS) aim to compress speech signals into discrete representations for faithful reconstruction. However, prior approaches faced challenges in effectively disentangling acoustic and semantic information within tokens, leading to a lack of fine-grained details in synthesized speech. In this study, we propose MBCodec, a novel multi-codebook audio codec based on Residual Vector Quantization (RVQ) that learns a hierarchically structured representation. MBCodec leverages self-supervised semantic tokenization and audio subband features from the raw signals to construct a functionally-disentangled latent space. In order to encourage comprehensive learning across various layers of the codec embedding space, we introduce adaptive dropout depths to differentially train codebooks across layers, and employ a multi-channel pseudo-quadrature mirror filter (PQMF) during training. By thoroughly decoupling semantic and acoustic features, our method not only achieves near-lossless speech reconstruction but also enables a remarkable 170x compression of 24 kHz audio, resulting in a low bit rate of just 2.2 kbps. Experimental evaluations confirm its consistent and substantial outperformance of baselines across all evaluations.