Latent-Domain Predictive Neural Speech Coding

📅 2022-07-18

🏛️ IEEE/ACM Transactions on Audio Speech and Language Processing

📈 Citations: 19

✨ Influential: 1

career value

213K/year

🤖 AI Summary

Existing neural speech codecs retain substantial temporal redundancy in acoustic or linguistic features, limiting coding efficiency. To address this, we propose TF-Codec: a VQ-VAE-based framework incorporating implicit temporal predictive coding—where feature encoding is conditioned on historically quantized latent variables; a learnable time-frequency adaptive compression module; and a differentiable vector quantization scheme fusing distance-based soft mapping with Gumbel-Softmax, jointly optimizing rate-distortion trade-offs and latent distribution modeling. Our key contributions are: (1) the first implicit-space conditional predictive coding mechanism; (2) joint time-frequency learnable compression; and (3) high-fidelity differentiable quantization. Experiments demonstrate that TF-Codec significantly outperforms Opus at 9 kbps when operating at only 1 kbps, and surpasses both EVS (9.6 kbps) and Opus (12 kbps) at 3 kbps. Multilingual subjective evaluations confirm its low latency (<20 ms) and high-quality speech reconstruction.

📝 Abstract

Neural audio/speech coding has recently demonstrated its capability to deliver high quality at much lower bitrates than traditional methods. However, existing neural audio/speech codecs employ either acoustic features or learned blind features with a convolutional neural network for encoding, by which there are still temporal redundancies within encoded features. This article introduces latent-domain predictive coding into the VQ-VAE framework to fully remove such redundancies and proposes the TF-Codec for low-latency neural speech coding in an end-to-end manner. Specifically, the extracted features are encoded conditioned on a prediction from past quantized latent frames so that temporal correlations are further removed. Moreover, we introduce a learnable compression on the time-frequency input to adaptively adjust the attention paid to main frequencies and details at different bitrates. A differentiable vector quantization scheme based on distance-to-soft mapping and Gumbel-Softmax is proposed to better model the latent distributions with rate constraint. Subjective results on multilingual speech datasets show that, with low latency, the proposed TF-Codec at 1 kbps achieves significantly better quality than Opus at 9 kbps, and TF-Codec at 3 kbps outperforms both EVS at 9.6 kbps and Opus at 12 kbps. Numerous studies are conducted to demonstrate the effectiveness of these techniques.

Problem

Research questions and friction points this paper is trying to address.

Reduces temporal redundancies in neural speech coding

Adaptively adjusts frequency attention at different bitrates

Improves speech quality at low bitrates with low latency

Innovation

Methods, ideas, or system contributions that make the work stand out.

Latent-domain predictive coding removes temporal redundancies

Learnable time-frequency compression adapts to bitrates

Differentiable vector quantization models latent distributions

🔎 Similar Papers

No similar papers found.