🤖 AI Summary
Existing speech codecs struggle to simultaneously achieve high-fidelity reconstruction and rich semantic representation, limiting their generalization across generative and understanding tasks. This paper proposes a Semantic-Acoustic Dual-Flow Quantized Neural Codec that decouples semantic and acoustic modeling pathways: one branch learns discrete semantic tokens, while the other captures fine-grained acoustic representations—both jointly optimized end-to-end. The architecture enables robust multi-rate encoding, achieving state-of-the-art reconstruction quality (↑ UTMOS, ↓ WER) in both clean and noisy conditions. Moreover, its semantic representations match the expressiveness of self-supervised continuous embedding models and substantially outperform prior codecs. The core innovation lies in the first explicit separation and co-optimization of semantic and acoustic quantization, enabling unified gains in both reconstruction fidelity and high-level semantic expressivity.
📝 Abstract
Speech codecs that convert continuous speech signals into discrete tokens have become essential for speech language models (SLMs). However, existing codecs struggle to balance high-quality reconstruction with semantically rich representations, limiting their effectiveness in both generative and understanding tasks. In this work, we propose SAC, a neural speech codec with semantic-acoustic dual-stream quantization. By disentangling semantic and acoustic modeling into two dedicated streams, SAC enables each to be optimized for its respective role. Comprehensive evaluations show that SAC achieves strong reconstruction performance across diverse bitrates under both clean and noisy conditions, with particularly high scores on UTMOS and WER, demonstrating superior perceptual quality and intelligibility. Moreover, SAC substantially outperforms state-of-the-art codecs in semantic representation, achieving a level comparable to that of self-supervised learning (SSL) continuous embeddings. Finally, our analysis of speech disentanglement highlights the effectiveness of the dual-stream design, offering new potential for controllable speech applications.