Inference-time Scaling for Diffusion-based Audio Super-resolution

📅 2025-08-04

📈 Citations: 0

✨ Influential: 0

career value

198K/year

🤖 AI Summary

Existing diffusion models for audio super-resolution suffer from sampling stochasticity, resulting in unstable output quality and high variance. This paper introduces a novel *inference-time scaling* paradigm that, for the first time, integrates multi-trajectory stochastic search with zeroth-order optimization guided by a task-specific validator—enabling efficient exploration of the high-dimensional solution space without increasing the number of sampling steps. The method is unified across speech, music, and sound effects super-resolution. On 4–24 kHz speech super-resolution, it achieves substantial improvements: +9.70% in aesthetic quality score, +5.88% in speaker similarity, −15.20% in word error rate, and −46.98% in spectral distance—outperforming all prior approaches comprehensively.

Technology Category

Application Category

📝 Abstract

Diffusion models have demonstrated remarkable success in generative tasks, including audio super-resolution (SR). In many applications like movie post-production and album mastering, substantial computational budgets are available for achieving superior audio quality. However, while existing diffusion approaches typically increase sampling steps to improve quality, the performance remains fundamentally limited by the stochastic nature of the sampling process, leading to high-variance and quality-limited outputs. Here, rather than simply increasing the number of sampling steps, we propose a different paradigm through inference-time scaling for SR, which explores multiple solution trajectories during the sampling process. Different task-specific verifiers are developed, and two search algorithms, including the random search and zero-order search for SR, are introduced. By actively guiding the exploration of the high-dimensional solution space through verifier-algorithm combinations, we enable more robust and higher-quality outputs. Through extensive validation across diverse audio domains (speech, music, sound effects) and frequency ranges, we demonstrate consistent performance gains, achieving improvements of up to 9.70% in aesthetics, 5.88% in speaker similarity, 15.20% in word error rate, and 46.98% in spectral distance for speech SR from 4kHz to 24kHz, showcasing the effectiveness of our approach. Audio samples are available at: https://racerk.github.io/tt-scale-audiosr/.

Problem

Research questions and friction points this paper is trying to address.

Improving audio super-resolution quality with diffusion models

Reducing output variance in diffusion-based sampling processes

Enhancing performance across diverse audio domains and frequencies

Innovation

Methods, ideas, or system contributions that make the work stand out.

Inference-time scaling for audio super-resolution

Verifier-algorithm guided solution space exploration

Multiple solution trajectories during sampling process

🔎 Similar Papers

Frequency-Domain Refinement with Multiscale Diffusion for Super Resolution