Modeling strategies for speech enhancement in the latent space of a neural audio codec

📅 2025-10-30

📈 Citations: 0

✨ Influential: 0

career value

218K/year

🤖 AI Summary

This study investigates the relative efficacy of continuous latent representations versus discrete tokens as supervision targets for neural audio codec (NAC)-based speech enhancement. We design both autoregressive and non-autoregressive models based on the Conformer architecture, trained end-to-end in the NAC latent space, and conduct a systematic comparison across modeling strategies. Key findings are: (1) Continuous latent prediction substantially outperforms discrete token modeling, yielding gains of 1.2–2.3 points in PESQ and STOI; (2) Non-autoregressive models achieve superior trade-offs between enhancement quality and inference latency; (3) Joint fine-tuning of the NAC encoder further boosts performance, establishing new state-of-the-art results. To our knowledge, this is the first empirical demonstration of the dominant advantage of continuous latent representations in NAC-driven speech enhancement, and it provides a practical, deployment-ready framework for efficient real-world implementation.

Technology Category

Application Category

📝 Abstract

Neural audio codecs (NACs) provide compact latent speech representations in the form of sequences of continuous vectors or discrete tokens. In this work, we investigate how these two types of speech representations compare when used as training targets for supervised speech enhancement. We consider both autoregressive and non-autoregressive speech enhancement models based on the Conformer architecture, as well as a simple baseline where the NAC encoder is simply fine-tuned for speech enhancement. Our experiments reveal three key findings: predicting continuous latent representations consistently outperforms discrete token prediction; autoregressive models achieve higher quality but at the expense of intelligibility and efficiency, making non-autoregressive models more attractive in practice; and encoder fine-tuning yields the strongest enhancement metrics overall, though at the cost of degraded codec reconstruction. The code and audio samples are available online.

Problem

Research questions and friction points this paper is trying to address.

Comparing continuous vs discrete latent representations for speech enhancement

Evaluating autoregressive vs non-autoregressive models for enhancement quality

Investigating encoder fine-tuning trade-offs between enhancement and reconstruction

Innovation

Methods, ideas, or system contributions that make the work stand out.

Uses neural audio codec latent space for enhancement

Compares continuous versus discrete latent representations

Evaluates autoregressive and non-autoregressive Conformer models

🔎 Similar Papers

No similar papers found.