SAC: Neural Speech Codec with Semantic-Acoustic Dual-Stream Quantization

📅 2025-10-19
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing speech codecs struggle to simultaneously achieve high-fidelity reconstruction and rich semantic representation, limiting their generalization across generative and understanding tasks. This paper proposes a Semantic-Acoustic Dual-Flow Quantized Neural Codec that decouples semantic and acoustic modeling pathways: one branch learns discrete semantic tokens, while the other captures fine-grained acoustic representations—both jointly optimized end-to-end. The architecture enables robust multi-rate encoding, achieving state-of-the-art reconstruction quality (↑ UTMOS, ↓ WER) in both clean and noisy conditions. Moreover, its semantic representations match the expressiveness of self-supervised continuous embedding models and substantially outperform prior codecs. The core innovation lies in the first explicit separation and co-optimization of semantic and acoustic quantization, enabling unified gains in both reconstruction fidelity and high-level semantic expressivity.

Technology Category

Application Category

📝 Abstract
Speech codecs that convert continuous speech signals into discrete tokens have become essential for speech language models (SLMs). However, existing codecs struggle to balance high-quality reconstruction with semantically rich representations, limiting their effectiveness in both generative and understanding tasks. In this work, we propose SAC, a neural speech codec with semantic-acoustic dual-stream quantization. By disentangling semantic and acoustic modeling into two dedicated streams, SAC enables each to be optimized for its respective role. Comprehensive evaluations show that SAC achieves strong reconstruction performance across diverse bitrates under both clean and noisy conditions, with particularly high scores on UTMOS and WER, demonstrating superior perceptual quality and intelligibility. Moreover, SAC substantially outperforms state-of-the-art codecs in semantic representation, achieving a level comparable to that of self-supervised learning (SSL) continuous embeddings. Finally, our analysis of speech disentanglement highlights the effectiveness of the dual-stream design, offering new potential for controllable speech applications.
Problem

Research questions and friction points this paper is trying to address.

Balancing speech reconstruction quality with semantic richness
Disentangling semantic and acoustic modeling for optimization
Improving codec performance across diverse bitrates and conditions
Innovation

Methods, ideas, or system contributions that make the work stand out.

Dual-stream quantization separates semantic and acoustic modeling
Achieves high reconstruction quality across diverse bitrates
Outperforms existing codecs in semantic representation capabilities
🔎 Similar Papers
No similar papers found.