Inference-time Scaling for Diffusion-based Audio Super-resolution

📅 2025-08-04
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing diffusion models for audio super-resolution suffer from sampling stochasticity, resulting in unstable output quality and high variance. This paper introduces a novel *inference-time scaling* paradigm that, for the first time, integrates multi-trajectory stochastic search with zeroth-order optimization guided by a task-specific validator—enabling efficient exploration of the high-dimensional solution space without increasing the number of sampling steps. The method is unified across speech, music, and sound effects super-resolution. On 4–24 kHz speech super-resolution, it achieves substantial improvements: +9.70% in aesthetic quality score, +5.88% in speaker similarity, −15.20% in word error rate, and −46.98% in spectral distance—outperforming all prior approaches comprehensively.

Technology Category

Application Category

📝 Abstract
Diffusion models have demonstrated remarkable success in generative tasks, including audio super-resolution (SR). In many applications like movie post-production and album mastering, substantial computational budgets are available for achieving superior audio quality. However, while existing diffusion approaches typically increase sampling steps to improve quality, the performance remains fundamentally limited by the stochastic nature of the sampling process, leading to high-variance and quality-limited outputs. Here, rather than simply increasing the number of sampling steps, we propose a different paradigm through inference-time scaling for SR, which explores multiple solution trajectories during the sampling process. Different task-specific verifiers are developed, and two search algorithms, including the random search and zero-order search for SR, are introduced. By actively guiding the exploration of the high-dimensional solution space through verifier-algorithm combinations, we enable more robust and higher-quality outputs. Through extensive validation across diverse audio domains (speech, music, sound effects) and frequency ranges, we demonstrate consistent performance gains, achieving improvements of up to 9.70% in aesthetics, 5.88% in speaker similarity, 15.20% in word error rate, and 46.98% in spectral distance for speech SR from 4kHz to 24kHz, showcasing the effectiveness of our approach. Audio samples are available at: https://racerk.github.io/tt-scale-audiosr/.
Problem

Research questions and friction points this paper is trying to address.

Improving audio super-resolution quality with diffusion models
Reducing output variance in diffusion-based sampling processes
Enhancing performance across diverse audio domains and frequencies
Innovation

Methods, ideas, or system contributions that make the work stand out.

Inference-time scaling for audio super-resolution
Verifier-algorithm guided solution space exploration
Multiple solution trajectories during sampling process
🔎 Similar Papers
No similar papers found.
Y
Yizhu Jin
The Hong Kong University of Science and Technology
Z
Zhen Ye
The Hong Kong University of Science and Technology
Zeyue Tian
Zeyue Tian
Hong Kong University of Science and Technology
Music GenerationGenerative AIMulti-Modal Learning
Haohe Liu
Haohe Liu
Research Scientist at Meta AI
Audio GenerationAudio ClassificationSpeech Quality EnhancementMusic Source Separation
Qiuqiang Kong
Qiuqiang Kong
The Chinese University of Hong Kong
Audio ProcessingArtificial Intelligence
Y
Yike Guo
The Hong Kong University of Science and Technology
W
Wei Xue
The Hong Kong University of Science and Technology