Speech Enhancement Using Continuous Embeddings of Neural Audio Codec

๐Ÿ“… 2025-02-22
๐Ÿ“ˆ Citations: 0
โœจ Influential: 0
๐Ÿ“„ PDF
๐Ÿค– AI Summary
To address high latency and computational overhead in cloud-based speech enhancement (SE), this work proposes performing SE directly within the pre-quantized continuous embedding space of neural audio codecs (NACs), bypassing conventional discrete token modeling and language-model-based paradigms. Methodologically, we leverage a pretrained NAC encoder to extract continuous temporal embeddings, introduce an embedding-level supervision loss, and employ a lightweight temporal modeling network for end-to-end optimization. Our key contribution is the first adaptation of SE to the pre-quantized continuous representation domain of NACsโ€”enabling both high-fidelity reconstruction and ultra-low inference latency. Experiments demonstrate a real-time factor of only 0.005, 3.94 GMACs, and an 18ร— reduction in computational complexity over SepFormer, while matching the performance of strong baselines trained on large-scale datasets.

Technology Category

Application Category

๐Ÿ“ Abstract
Recent advancements in Neural Audio Codec (NAC) models have inspired their use in various speech processing tasks, including speech enhancement (SE). In this work, we propose a novel, efficient SE approach by leveraging the pre-quantization output of a pretrained NAC encoder. Unlike prior NAC-based SE methods, which process discrete speech tokens using Language Models (LMs), we perform SE within the continuous embedding space of the pretrained NAC, which is highly compressed along the time dimension for efficient representation. Our lightweight SE model, optimized through an embedding-level loss, delivers results comparable to SE baselines trained on larger datasets, with a significantly lower real-time factor of 0.005. Additionally, our method achieves a low GMAC of 3.94, reducing complexity 18-fold compared to Sepformer in a simulated cloud-based audio transmission environment. This work highlights a new, efficient NAC-based SE solution, particularly suitable for cloud applications where NAC is used to compress audio before transmission. Copyright 20XX IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses, in any current or future media, including reprinting/republishing this material for advertising or promotional purposes, creating new collective works, for resale or redistribution to servers or lists, or reuse of any copyrighted component of this work in other works.
Problem

Research questions and friction points this paper is trying to address.

Enhance speech using continuous embeddings
Reduce complexity in audio processing
Optimize for cloud-based audio transmission
Innovation

Methods, ideas, or system contributions that make the work stand out.

Utilizes pre-quantization NAC encoder output
Performs SE in continuous embedding space
Optimizes lightweight model with embedding-level loss
๐Ÿ”Ž Similar Papers
No similar papers found.
H
Haoyang Li
Nanyang Technological University, Singapore
Jia Qi Yip
Jia Qi Yip
Menlo Research
signal processingspeech separationspeaker verification
Tianyu Fan
Tianyu Fan
The University of Hong Kong
DeepResearchLLMagent
E
Eng Siong Chng
Nanyang Technological University, Singapore