π€ AI Summary
Existing neural speech codecs retain substantial temporal redundancy in acoustic or linguistic features, limiting coding efficiency. To address this, we propose TF-Codec: a VQ-VAE-based framework incorporating implicit temporal predictive codingβwhere feature encoding is conditioned on historically quantized latent variables; a learnable time-frequency adaptive compression module; and a differentiable vector quantization scheme fusing distance-based soft mapping with Gumbel-Softmax, jointly optimizing rate-distortion trade-offs and latent distribution modeling. Our key contributions are: (1) the first implicit-space conditional predictive coding mechanism; (2) joint time-frequency learnable compression; and (3) high-fidelity differentiable quantization. Experiments demonstrate that TF-Codec significantly outperforms Opus at 9 kbps when operating at only 1 kbps, and surpasses both EVS (9.6 kbps) and Opus (12 kbps) at 3 kbps. Multilingual subjective evaluations confirm its low latency (<20 ms) and high-quality speech reconstruction.
π Abstract
Neural audio/speech coding has recently demonstrated its capability to deliver high quality at much lower bitrates than traditional methods. However, existing neural audio/speech codecs employ either acoustic features or learned blind features with a convolutional neural network for encoding, by which there are still temporal redundancies within encoded features. This article introduces latent-domain predictive coding into the VQ-VAE framework to fully remove such redundancies and proposes the TF-Codec for low-latency neural speech coding in an end-to-end manner. Specifically, the extracted features are encoded conditioned on a prediction from past quantized latent frames so that temporal correlations are further removed. Moreover, we introduce a learnable compression on the time-frequency input to adaptively adjust the attention paid to main frequencies and details at different bitrates. A differentiable vector quantization scheme based on distance-to-soft mapping and Gumbel-Softmax is proposed to better model the latent distributions with rate constraint. Subjective results on multilingual speech datasets show that, with low latency, the proposed TF-Codec at 1 kbps achieves significantly better quality than Opus at 9 kbps, and TF-Codec at 3 kbps outperforms both EVS at 9.6 kbps and Opus at 12 kbps. Numerous studies are conducted to demonstrate the effectiveness of these techniques.