Latent-Domain Predictive Neural Speech Coding

πŸ“… 2022-07-18
πŸ›οΈ IEEE/ACM Transactions on Audio Speech and Language Processing
πŸ“ˆ Citations: 19
✨ Influential: 1
πŸ“„ PDF
πŸ€– AI Summary
Existing neural speech codecs retain substantial temporal redundancy in acoustic or linguistic features, limiting coding efficiency. To address this, we propose TF-Codec: a VQ-VAE-based framework incorporating implicit temporal predictive codingβ€”where feature encoding is conditioned on historically quantized latent variables; a learnable time-frequency adaptive compression module; and a differentiable vector quantization scheme fusing distance-based soft mapping with Gumbel-Softmax, jointly optimizing rate-distortion trade-offs and latent distribution modeling. Our key contributions are: (1) the first implicit-space conditional predictive coding mechanism; (2) joint time-frequency learnable compression; and (3) high-fidelity differentiable quantization. Experiments demonstrate that TF-Codec significantly outperforms Opus at 9 kbps when operating at only 1 kbps, and surpasses both EVS (9.6 kbps) and Opus (12 kbps) at 3 kbps. Multilingual subjective evaluations confirm its low latency (<20 ms) and high-quality speech reconstruction.
πŸ“ Abstract
Neural audio/speech coding has recently demonstrated its capability to deliver high quality at much lower bitrates than traditional methods. However, existing neural audio/speech codecs employ either acoustic features or learned blind features with a convolutional neural network for encoding, by which there are still temporal redundancies within encoded features. This article introduces latent-domain predictive coding into the VQ-VAE framework to fully remove such redundancies and proposes the TF-Codec for low-latency neural speech coding in an end-to-end manner. Specifically, the extracted features are encoded conditioned on a prediction from past quantized latent frames so that temporal correlations are further removed. Moreover, we introduce a learnable compression on the time-frequency input to adaptively adjust the attention paid to main frequencies and details at different bitrates. A differentiable vector quantization scheme based on distance-to-soft mapping and Gumbel-Softmax is proposed to better model the latent distributions with rate constraint. Subjective results on multilingual speech datasets show that, with low latency, the proposed TF-Codec at 1 kbps achieves significantly better quality than Opus at 9 kbps, and TF-Codec at 3 kbps outperforms both EVS at 9.6 kbps and Opus at 12 kbps. Numerous studies are conducted to demonstrate the effectiveness of these techniques.
Problem

Research questions and friction points this paper is trying to address.

Reduces temporal redundancies in neural speech coding
Adaptively adjusts frequency attention at different bitrates
Improves speech quality at low bitrates with low latency
Innovation

Methods, ideas, or system contributions that make the work stand out.

Latent-domain predictive coding removes temporal redundancies
Learnable time-frequency compression adapts to bitrates
Differentiable vector quantization models latent distributions
πŸ”Ž Similar Papers
No similar papers found.
X
Xue Jiang
School of Information and Communication Engineering, Communication University of China, Beijing 100024, China
Xiulian Peng
Xiulian Peng
Researcher at Microsoft Research Asia
deep learningaudio and speechcomputer visionreal-time communicationimage/video coding
H
Huaying Xue
Microsoft Research Asia, Beijing 100080, China
Y
Yuan Zhang
State Key Laboratory of Media Convergence and Communication, Communication University of China, Beijing 100024, China
Y
Yan Lu
Microsoft Research Asia, Beijing 100080, China