🤖 AI Summary
Existing autoregressive speech enhancement methods rely on multi-stage pipelines and low-sampling-rate codecs, compromising both high fidelity and task generality. This paper proposes DAC-SE1, the first framework to directly integrate fine-grained audio tokens generated by a high-resolution discrete audio codec (DAC) into an autoregressive Transformer, enabling end-to-end, single-stage high-quality speech enhancement. DAC-SE1 jointly models perceptual quality restoration and semantic consistency within a unified architecture, significantly simplifying design and improving scalability. Experiments demonstrate that DAC-SE1 surpasses state-of-the-art autoregressive methods across objective metrics—including PESQ and STOI—as well as in MUSHRA subjective listening tests. It achieves substantial improvements in speech intelligibility, naturalness, and fidelity, establishing a new benchmark for autoregressive speech enhancement.
📝 Abstract
Recent autoregressive transformer-based speech enhancement (SE) methods have shown promising results by leveraging advanced semantic understanding and contextual modeling of speech. However, these approaches often rely on complex multi-stage pipelines and low sampling rate codecs, limiting them to narrow and task-specific speech enhancement. In this work, we introduce DAC-SE1, a simplified language model-based SE framework leveraging discrete high-resolution audio representations; DAC-SE1 preserves fine-grained acoustic details while maintaining semantic coherence. Our experiments show that DAC-SE1 surpasses state-of-the-art autoregressive SE methods on both objective perceptual metrics and in a MUSHRA human evaluation. We release our codebase and model checkpoints to support further research in scalable, unified, and high-quality speech enhancement.