🤖 AI Summary
To address high-frequency reconstruction distortion, poor cross-domain generalization, and inconsistent representations in cascaded models for speech super-resolution, this paper proposes an end-to-end 48 kHz waveform reconstruction framework. We design a unified Transformer-ConvNet generator that jointly leverages global contextual modeling and local detail recovery. Furthermore, we introduce a multi-band, multi-scale time-frequency discriminator and a multi-scale mel-spectrogram reconstruction loss to collaboratively enhance high-frequency fidelity and out-of-domain robustness. Experiments demonstrate that our method achieves state-of-the-art performance across objective metrics—including PESQ, STOI, and LSD—as well as in subjective ABX listening tests. Notably, it delivers significant quality improvements on both in-domain and out-of-domain speech, confirming its effectiveness and generalizability.
📝 Abstract
The application of generative adversarial networks (GANs) has recently advanced speech super-resolution (SR) based on intermediate representations like mel-spectrograms. However, existing SR methods that typically rely on independently trained and concatenated networks may lead to inconsistent representations and poor speech quality, especially in out-of-domain scenarios. In this work, we propose HiFi-SR, a unified network that leverages end-to-end adversarial training to achieve high-fidelity speech super-resolution. Our model features a unified transformer-convolutional generator designed to seamlessly handle both the prediction of latent representations and their conversion into time-domain waveforms. The transformer network serves as a powerful encoder, converting low-resolution mel-spectrograms into latent space representations, while the convolutional network upscales these representations into high-resolution waveforms. To enhance high-frequency fidelity, we incorporate a multi-band, multi-scale time-frequency discriminator, along with a multi-scale mel-reconstruction loss in the adversarial training process. HiFi-SR is versatile, capable of upscaling any input speech signal between 4 kHz and 32 kHz to a 48 kHz sampling rate. Experimental results demonstrate that HiFi-SR significantly outperforms existing speech SR methods across both objective metrics and ABX preference tests, for both in-domain and out-of-domain scenarios (https://github.com/modelscope/ClearerVoice-Studio).