High-Fidelity Speech Enhancement via Discrete Audio Tokens

📅 2025-10-02

📈 Citations: 0

✨ Influential: 0

career value

196K/year

🤖 AI Summary

Existing autoregressive speech enhancement methods rely on multi-stage pipelines and low-sampling-rate codecs, compromising both high fidelity and task generality. This paper proposes DAC-SE1, the first framework to directly integrate fine-grained audio tokens generated by a high-resolution discrete audio codec (DAC) into an autoregressive Transformer, enabling end-to-end, single-stage high-quality speech enhancement. DAC-SE1 jointly models perceptual quality restoration and semantic consistency within a unified architecture, significantly simplifying design and improving scalability. Experiments demonstrate that DAC-SE1 surpasses state-of-the-art autoregressive methods across objective metrics—including PESQ and STOI—as well as in MUSHRA subjective listening tests. It achieves substantial improvements in speech intelligibility, naturalness, and fidelity, establishing a new benchmark for autoregressive speech enhancement.

Technology Category

Application Category

📝 Abstract

Recent autoregressive transformer-based speech enhancement (SE) methods have shown promising results by leveraging advanced semantic understanding and contextual modeling of speech. However, these approaches often rely on complex multi-stage pipelines and low sampling rate codecs, limiting them to narrow and task-specific speech enhancement. In this work, we introduce DAC-SE1, a simplified language model-based SE framework leveraging discrete high-resolution audio representations; DAC-SE1 preserves fine-grained acoustic details while maintaining semantic coherence. Our experiments show that DAC-SE1 surpasses state-of-the-art autoregressive SE methods on both objective perceptual metrics and in a MUSHRA human evaluation. We release our codebase and model checkpoints to support further research in scalable, unified, and high-quality speech enhancement.

Problem

Research questions and friction points this paper is trying to address.

Enhancing speech quality using discrete audio tokens

Overcoming limitations of complex multi-stage SE pipelines

Preserving acoustic details while maintaining semantic coherence

Innovation

Methods, ideas, or system contributions that make the work stand out.

Leverages discrete high-resolution audio representations

Simplifies language model-based speech enhancement framework

Preserves fine-grained acoustic details and semantic coherence

🔎 Similar Papers

No similar papers found.