π€ AI Summary
Consumer-device voice recordings frequently suffer from multiple concurrent degradations, including noise, reverberation, bandwidth limitation, and clipping. This paper proposes an efficient end-to-end speech restoration method that performs single-stage joint modeling in the complex-valued STFT domain, incorporates a phase-aware loss, and supports large analysis windows to enhance frequency resolution. Its lightweight neural architecture achieves 10.5Γ real-time inference on an iPhone 12 CPU with sub-10 ms latency. Key contributions include: (1) EDBβthe first open-source benchmark dataset targeting extreme degradation scenarios; (2) state-of-the-art performance on the DNS 5 blind test set, surpassing strong GAN-based baselines and approaching flow-matching methods; and (3) significant improvements over all open-source models on EDB, matching the quality of commercial systems.
π Abstract
Vocal recordings on consumer devices commonly suffer from multiple concurrent degradations: noise, reverberation, band-limiting, and clipping. We present Smule Renaissance Small (SRS), a compact single-stage model that performs end-to-end vocal restoration directly in the complex STFT domain. By incorporating phase-aware losses, SRS enables large analysis windows for improved frequency resolution while achieving 10.5x real-time inference on iPhone 12 CPU at 48 kHz. On the DNS 5 Challenge blind set, despite no speech training, SRS outperforms a strong GAN baseline and closely matches a computationally expensive flow-matching system. To enable evaluation under realistic multi-degradation scenarios, we introduce the Extreme Degradation Bench (EDB): 87 singing and speech recordings captured under severe acoustic conditions. On EDB, SRS surpasses all open-source baselines on singing and matches commercial systems, while remaining competitive on speech despite no speech-specific training. We release both SRS and EDB under the MIT License.