🤖 AI Summary
This paper addresses the general audio super-resolution (SR) problem—upscaling low-sample-rate (4–32 kHz) audio from diverse domains (speech, music, sound effects) to 48 kHz. We propose an efficient one-step diffusion model. Methodologically, we introduce the first distribution-matching distillation framework for audio SR, integrating spectrogram-domain modeling with adversarial training, and design a dedicated SR vocoder for end-to-end high-fidelity reconstruction. Our key contributions are: (i) the first one-step diffusion distillation paradigm tailored for audio SR; (ii) an explicit distillation loss that optimizes alignment between predicted and ground-truth spectrogram distributions; and (iii) a lightweight spectrogram SR architecture. Experiments demonstrate state-of-the-art performance in objective metrics (PESQ, STOI), significantly higher mean opinion score (MOS) in subjective evaluation, and a 22× speedup in inference latency—achieving an unprecedented balance among reconstruction quality, cross-domain generalization, and real-time applicability.
📝 Abstract
Versatile audio super-resolution (SR) is the challenging task of restoring high-frequency components from low-resolution audio with sampling rates between 4kHz and 32kHz in various domains such as music, speech, and sound effects. Previous diffusion-based SR methods suffer from slow inference due to the need for a large number of sampling steps. In this paper, we introduce FlashSR, a single-step diffusion model for versatile audio super-resolution aimed at producing 48kHz audio. FlashSR achieves fast inference by utilizing diffusion distillation with three objectives: distillation loss, adversarial loss, and distribution-matching distillation loss. We further enhance performance by proposing the SR Vocoder, which is specifically designed for SR models operating on mel-spectrograms. FlashSR demonstrates competitive performance with the current state-of-the-art model in both objective and subjective evaluations while being approximately 22 times faster.