ZipEnhancer: Dual-Path Down-Up Sampling-based Zipformer for Monaural Speech Enhancement

📅 2025-01-09

📈 Citations: 0

✨ Influential: 0

career value

171K/year

🤖 AI Summary

To address the high computational cost and poor scalability of dual-path time-domain–time-frequency-domain models in single-channel speech enhancement, this paper proposes Zipformer-Enhancer—a lightweight dual-path architecture leveraging downsampled and upsampled representations. Key contributions include: (i) the first symmetric Dual-Path DownSampleStacks, which drastically reduces hidden feature dimensionality; (ii) a parameter-efficient ZipformerBlock designed for enhanced representational capacity with minimal overhead; and (iii) ScaleAdam, a sparse-gradient-adapted optimizer, coupled with the Eden learning rate scheduler. Evaluated on DNS 2020 and VoiceBank+DEMAND, Zipformer-Enhancer achieves state-of-the-art performance (PESQ = 3.69 / 3.63) with only 2.04M parameters and 62.41G FLOPS—marking a significant improvement in the trade-off between computational efficiency and modeling capability.

Technology Category

Application Category

📝 Abstract

In contrast to other sequence tasks modeling hidden layer features with three axes, Dual-Path time and time-frequency domain speech enhancement models are effective and have low parameters but are computationally demanding due to their hidden layer features with four axes. We propose ZipEnhancer, which is Dual-Path Down-Up Sampling-based Zipformer for Monaural Speech Enhancement, incorporating time and frequency domain Down-Up sampling to reduce computational costs. We introduce the ZipformerBlock as the core block and propose the design of the Dual-Path DownSampleStacks that symmetrically scale down and scale up. Also, we introduce the ScaleAdam optimizer and Eden learning rate scheduler to improve the performance further. Our model achieves new state-of-the-art results on the DNS 2020 Challenge and Voicebank+DEMAND datasets, with a perceptual evaluation of speech quality (PESQ) of 3.69 and 3.63, using 2.04M parameters and 62.41G FLOPS, outperforming other methods with similar complexity levels.

Problem

Research questions and friction points this paper is trying to address.

Speech Enhancement

High Computational Cost

Monoaural Audio Quality

Innovation

Methods, ideas, or system contributions that make the work stand out.

ZipEnhancer

Zipformer

Dual-path Sampling

🔎 Similar Papers

No similar papers found.