🤖 AI Summary
RVQ-based neural audio codecs suffer from high quantization error and limited synthesis quality during inference due to fixed codebooks. Method: This paper proposes a test-time dynamic code selection algorithm that requires no retraining. At inference, it jointly optimizes discrete codebook indices across all RVQ levels via hierarchical codebook search and layer-wise quantization error minimization, overcoming the suboptimality of conventional greedy encoding. Contribution/Results: The core innovation lies in formulating codebook selection as a differentiable path optimization problem—preserving discrete constraints while enabling end-to-end gradient backpropagation of reconstruction error. Experiments demonstrate significant improvements in both objective metrics (e.g., LSD, MRSTFT) and subjective MOS scores, with an average 23.6% reduction in quantization error. The method is fully compatible with existing RVQ models and incurs negligible deployment overhead.
📝 Abstract
The residual vector quantization (RVQ) technique plays a central role in recent advances in neural audio codecs. These models effectively synthesize high-fidelity audio from a limited number of codes due to the hierarchical structure among quantization levels. In this paper, we propose an encoding algorithm to further enhance the synthesis quality of RVQ-based neural codecs at test-time. Firstly, we point out the suboptimal nature of quantized vectors generated by conventional methods. We demonstrate that quantization error can be mitigated by selecting a different set of codes. Subsequently, we present our encoding algorithm, designed to identify a set of discrete codes that achieve a lower quantization error. We then apply the proposed method to pre-trained models and evaluate its efficacy using diverse metrics. Our experimental findings validate that our method not only reduces quantization errors, but also improves synthesis quality.