🤖 AI Summary
Audio super-resolution (SR) is often limited by insufficient exploitation of low-resolution (LR) signal priors, resulting in suboptimal upsampled quality. To address this, we propose Latent-space Bridging Models (LBMs), which enable efficient LR-to-high-resolution (HR) mapping within a diffusion-based generative framework. LBMs incorporate three key components: continuous latent-space compression, a frequency-domain-aware bridging mechanism, and a cascaded architecture. Crucially, the method supports arbitrary upscaling factors during training and—uniquely—extends the upper bound of audio SR to 192 kHz, surpassing the prior state-of-the-art of 48 kHz. On benchmarks including VCTK and ESC-50, LBMs achieve state-of-the-art objective scores (PESQ/STOI) and superior subjective audio quality for 48 kHz SR; the 192 kHz upscaling is the first empirically validated demonstration at this sampling rate. Our core contributions are: (1) a frequency-domain-aware latent-space bridging design, and (2) a scalable, general-purpose SR paradigm applicable to ultra-high-fidelity sampling rates.
📝 Abstract
Audio super-resolution (SR), i.e., upsampling the low-resolution (LR) waveform to the high-resolution (HR) version, has recently been explored with diffusion and bridge models, while previous methods often suffer from sub-optimal upsampling quality due to their uninformative generation prior. Towards high-quality audio super-resolution, we present a new system with latent bridge models (LBMs), where we compress the audio waveform into a continuous latent space and design an LBM to enable a latent-to-latent generation process that naturally matches the LR-toHR upsampling process, thereby fully exploiting the instructive prior information contained in the LR waveform. To further enhance the training results despite the limited availability of HR samples, we introduce frequency-aware LBMs, where the prior and target frequency are taken as model input, enabling LBMs to explicitly learn an any-to-any upsampling process at the training stage. Furthermore, we design cascaded LBMs and present two prior augmentation strategies, where we make the first attempt to unlock the audio upsampling beyond 48 kHz and empower a seamless cascaded SR process, providing higher flexibility for audio post-production. Comprehensive experimental results evaluated on the VCTK, ESC-50, Song-Describer benchmark datasets and two internal testsets demonstrate that we achieve state-of-the-art objective and perceptual quality for any-to-48kHz SR across speech, audio, and music signals, as well as setting the first record for any-to-192kHz audio SR. Demo at https://AudioLBM.github.io/.