HiFi-SR: A Unified Generative Transformer-Convolutional Adversarial Network for High-Fidelity Speech Super-Resolution

📅 2025-01-17

📈 Citations: 0

✨ Influential: 0

career value

207K/year

🤖 AI Summary

To address high-frequency reconstruction distortion, poor cross-domain generalization, and inconsistent representations in cascaded models for speech super-resolution, this paper proposes an end-to-end 48 kHz waveform reconstruction framework. We design a unified Transformer-ConvNet generator that jointly leverages global contextual modeling and local detail recovery. Furthermore, we introduce a multi-band, multi-scale time-frequency discriminator and a multi-scale mel-spectrogram reconstruction loss to collaboratively enhance high-frequency fidelity and out-of-domain robustness. Experiments demonstrate that our method achieves state-of-the-art performance across objective metrics—including PESQ, STOI, and LSD—as well as in subjective ABX listening tests. Notably, it delivers significant quality improvements on both in-domain and out-of-domain speech, confirming its effectiveness and generalizability.

Technology Category

Application Category

📝 Abstract

The application of generative adversarial networks (GANs) has recently advanced speech super-resolution (SR) based on intermediate representations like mel-spectrograms. However, existing SR methods that typically rely on independently trained and concatenated networks may lead to inconsistent representations and poor speech quality, especially in out-of-domain scenarios. In this work, we propose HiFi-SR, a unified network that leverages end-to-end adversarial training to achieve high-fidelity speech super-resolution. Our model features a unified transformer-convolutional generator designed to seamlessly handle both the prediction of latent representations and their conversion into time-domain waveforms. The transformer network serves as a powerful encoder, converting low-resolution mel-spectrograms into latent space representations, while the convolutional network upscales these representations into high-resolution waveforms. To enhance high-frequency fidelity, we incorporate a multi-band, multi-scale time-frequency discriminator, along with a multi-scale mel-reconstruction loss in the adversarial training process. HiFi-SR is versatile, capable of upscaling any input speech signal between 4 kHz and 32 kHz to a 48 kHz sampling rate. Experimental results demonstrate that HiFi-SR significantly outperforms existing speech SR methods across both objective metrics and ABX preference tests, for both in-domain and out-of-domain scenarios (https://github.com/modelscope/ClearerVoice-Studio).

Problem

Research questions and friction points this paper is trying to address.

Speech Clarity

Audio Naturalness

High-frequency Sounds

Innovation

Methods, ideas, or system contributions that make the work stand out.

Transformer

Convolutional GAN

Frequency Discriminator

🔎 Similar Papers

Wave-U-Mamba: An End-To-End Framework For High-Quality And Efficient Speech Super Resolution