Smule Renaissance Small: Efficient General-Purpose Vocal Restoration

πŸ“… 2025-10-24
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
Consumer-device voice recordings frequently suffer from multiple concurrent degradations, including noise, reverberation, bandwidth limitation, and clipping. This paper proposes an efficient end-to-end speech restoration method that performs single-stage joint modeling in the complex-valued STFT domain, incorporates a phase-aware loss, and supports large analysis windows to enhance frequency resolution. Its lightweight neural architecture achieves 10.5Γ— real-time inference on an iPhone 12 CPU with sub-10 ms latency. Key contributions include: (1) EDBβ€”the first open-source benchmark dataset targeting extreme degradation scenarios; (2) state-of-the-art performance on the DNS 5 blind test set, surpassing strong GAN-based baselines and approaching flow-matching methods; and (3) significant improvements over all open-source models on EDB, matching the quality of commercial systems.

Technology Category

Application Category

πŸ“ Abstract
Vocal recordings on consumer devices commonly suffer from multiple concurrent degradations: noise, reverberation, band-limiting, and clipping. We present Smule Renaissance Small (SRS), a compact single-stage model that performs end-to-end vocal restoration directly in the complex STFT domain. By incorporating phase-aware losses, SRS enables large analysis windows for improved frequency resolution while achieving 10.5x real-time inference on iPhone 12 CPU at 48 kHz. On the DNS 5 Challenge blind set, despite no speech training, SRS outperforms a strong GAN baseline and closely matches a computationally expensive flow-matching system. To enable evaluation under realistic multi-degradation scenarios, we introduce the Extreme Degradation Bench (EDB): 87 singing and speech recordings captured under severe acoustic conditions. On EDB, SRS surpasses all open-source baselines on singing and matches commercial systems, while remaining competitive on speech despite no speech-specific training. We release both SRS and EDB under the MIT License.
Problem

Research questions and friction points this paper is trying to address.

Restoring vocal recordings with multiple concurrent degradations
Developing efficient single-stage model for vocal restoration
Enhancing performance under realistic multi-degradation scenarios
Innovation

Methods, ideas, or system contributions that make the work stand out.

Compact single-stage model for vocal restoration
Incorporates phase-aware losses in complex STFT domain
Achieves real-time inference on mobile CPU
πŸ”Ž Similar Papers
No similar papers found.
Yongyi Zang
Yongyi Zang
Smule, Inc.
Computer AuditionSpeech ProcessingMusic Information RetrievalMusic Composition
C
Chris Manchester
Smule Labs
David Young
David Young
Smule Labs
Ivan Ivanov
Ivan Ivanov
Smule Labs
J
Jeffrey Lufkin
Smule Labs
M
Martin Vladimirov
Smule Labs
P
PJ Solomon
Smule Labs
S
Svetoslav Kepchelev
Smule Labs
F
Fei Yueh Chen
University of Rochester
D
Dongting Cai
University of California, San Diego
T
Teodor Naydenov
Smule Labs
R
Randal Leistikow
Smule Labs