🤖 AI Summary
Existing diffusion models and Schrödinger Bridge (SB) methods for speech enhancement require ≥50 sampling steps, resulting in slow inference and severe performance degradation under low SNR when using few-step sampling. This work proposes the first integration of SB theory with generative adversarial networks (GANs) to construct an end-to-end differentiable, few-step reversible generative framework: SB theory is employed to model the prior distribution, while adversarial training enhances single-step reconstruction fidelity and ensures alignment between the generated and real speech distributions. Evaluated on full-band speech enhancement, our method achieves state-of-the-art performance with only one inference step—outperforming mainstream multi-step diffusion and SB baselines. It yields significant improvements in denoising (PESQ +1.2) and dereverberation (STOI +3.8%), effectively breaking the quality bottleneck inherent in few-step sampling.
📝 Abstract
Deep generative models have recently been employed for speech enhancement to generate perceptually valid clean speech on large-scale datasets. Several diffusion models have been proposed, and more recently, a tractable Schr""odinger Bridge has been introduced to transport between the clean and noisy speech distributions. However, these models often suffer from an iterative reverse process and require a large number of sampling steps -- more than 50. Our investigation reveals that the performance of baseline models significantly degrades when the number of sampling steps is reduced, particularly under low-SNR conditions. We propose integrating Schr""odinger Bridge with GANs to effectively mitigate this issue, achieving high-quality outputs on full-band datasets while substantially reducing the required sampling steps. Experimental results demonstrate that our proposed model outperforms existing baselines, even with a single inference step, in both denoising and dereverberation tasks.