SACodec: Asymmetric Quantization with Semantic Anchoring for Low-Bitrate High-Fidelity Neural Speech Codecs

📅 2025-12-23
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
At ultra-low bitrates, neural speech codecs struggle to simultaneously preserve acoustic fidelity and semantic richness. To address this, we propose a semantic-anchored asymmetric dual-quantization architecture. Our method introduces a novel semantic anchoring mechanism that aligns acoustic features with a frozen mHuBERT codebook via a lightweight projector; designs a residual SimVQ acoustic pathway enabling high-fidelity reconstruction under single-layer quantization; and explicitly decouples semantic and acoustic representation learning. Evaluated at 1.5 kbps, our approach achieves new state-of-the-art performance: subjective MOS scores approach those of the original waveform, while downstream semantic capabilities—e.g., ASR accuracy and emotion recognition—show substantial improvement. To our knowledge, this is the first work to achieve synergistic optimization of both fidelity and semantic expressiveness at such an extreme bitrate.

Technology Category

Application Category

📝 Abstract
Neural Speech Codecs face a fundamental trade-off at low bitrates: preserving acoustic fidelity often compromises semantic richness. To address this, we introduce SACodec, a novel codec built upon an asymmetric dual-quantizer that employs our proposed Semantic Anchoring mechanism. This design strategically decouples the quantization of Semantic and Acoustic details. The semantic anchoring is achieved via a lightweight projector that aligns acoustic features with a frozen, large-scale mHuBERT codebook, injecting linguistic priors while guaranteeing full codebook utilization. Sequentially, for acoustic details, a residual activation module with SimVQ enables a single-layer quantizer (acoustic path) to faithfully recover fine-grained information. At just 1.5 kbps, SACodec establishes a new state of the art by excelling in both fidelity and semantics: subjective listening tests confirm that its reconstruction quality is perceptually highly comparable to ground-truth audio, while its tokens demonstrate substantially improved semantic richness in downstream tasks.
Problem

Research questions and friction points this paper is trying to address.

Addresses low-bitrate trade-off between fidelity and semantics
Decouples quantization of semantic and acoustic details
Improves semantic richness and fidelity at 1.5 kbps
Innovation

Methods, ideas, or system contributions that make the work stand out.

Asymmetric dual-quantizer decouples semantic and acoustic quantization
Semantic anchoring aligns features with frozen mHuBERT codebook
Residual activation module enables fine-grained acoustic detail recovery
🔎 Similar Papers
No similar papers found.
Z
Zhongren Dong
College of Computer Science and Electronic Engineering, Hunan University, Changsha, China
B
Bin Wang
Beijing Xiaomi Mobile Software Co., Ltd, Beijing, China
Jing Han
Jing Han
University of Cambridge
deep learningaudio signal processingmachine learningmHealthaffective computing
H
Haotian Guo
College of Computer Science and Electronic Engineering, Hunan University, Changsha, China
X
Xiaojun Mo
College of Computer Science and Electronic Engineering, Hunan University, Changsha, China
Y
Yimin Cao
College of Computer Science and Electronic Engineering, Hunan University, Changsha, China
Zixing Zhang
Zixing Zhang
Professor, Hunan University
Artifical IntelligenceSpeech ProcessingAffective ComputingDigital HealthAutomatic Speech Recognition