FlashSR: One-step Versatile Audio Super-resolution via Diffusion Distillation

📅 2025-01-18

📈 Citations: 0

✨ Influential: 0

career value

210K/year

🤖 AI Summary

This paper addresses the general audio super-resolution (SR) problem—upscaling low-sample-rate (4–32 kHz) audio from diverse domains (speech, music, sound effects) to 48 kHz. We propose an efficient one-step diffusion model. Methodologically, we introduce the first distribution-matching distillation framework for audio SR, integrating spectrogram-domain modeling with adversarial training, and design a dedicated SR vocoder for end-to-end high-fidelity reconstruction. Our key contributions are: (i) the first one-step diffusion distillation paradigm tailored for audio SR; (ii) an explicit distillation loss that optimizes alignment between predicted and ground-truth spectrogram distributions; and (iii) a lightweight spectrogram SR architecture. Experiments demonstrate state-of-the-art performance in objective metrics (PESQ, STOI), significantly higher mean opinion score (MOS) in subjective evaluation, and a 22× speedup in inference latency—achieving an unprecedented balance among reconstruction quality, cross-domain generalization, and real-time applicability.

Technology Category

Application Category

📝 Abstract

Versatile audio super-resolution (SR) is the challenging task of restoring high-frequency components from low-resolution audio with sampling rates between 4kHz and 32kHz in various domains such as music, speech, and sound effects. Previous diffusion-based SR methods suffer from slow inference due to the need for a large number of sampling steps. In this paper, we introduce FlashSR, a single-step diffusion model for versatile audio super-resolution aimed at producing 48kHz audio. FlashSR achieves fast inference by utilizing diffusion distillation with three objectives: distillation loss, adversarial loss, and distribution-matching distillation loss. We further enhance performance by proposing the SR Vocoder, which is specifically designed for SR models operating on mel-spectrograms. FlashSR demonstrates competitive performance with the current state-of-the-art model in both objective and subjective evaluations while being approximately 22 times faster.

Problem

Research questions and friction points this paper is trying to address.

Audio Super-Resolution

Low-Quality Audio

High-Frequency Reconstruction

Innovation

Methods, ideas, or system contributions that make the work stand out.

FlashSR

Diffusion Distillation

SR Vocoder Optimization

🔎 Similar Papers

High-Resolution Speech Restoration with Latent Diffusion Model

2024-09-17arXiv.orgCitations: 0

TikTok

San Jose, California

Sr. Research Engineer/Scientist (all levels), Efficient Models

TikTok

Seattle, Washington

Authors to Follow