HiFi-SR: A Unified Generative Transformer-Convolutional Adversarial Network for High-Fidelity Speech Super-Resolution

📅 2025-01-17
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address high-frequency reconstruction distortion, poor cross-domain generalization, and inconsistent representations in cascaded models for speech super-resolution, this paper proposes an end-to-end 48 kHz waveform reconstruction framework. We design a unified Transformer-ConvNet generator that jointly leverages global contextual modeling and local detail recovery. Furthermore, we introduce a multi-band, multi-scale time-frequency discriminator and a multi-scale mel-spectrogram reconstruction loss to collaboratively enhance high-frequency fidelity and out-of-domain robustness. Experiments demonstrate that our method achieves state-of-the-art performance across objective metrics—including PESQ, STOI, and LSD—as well as in subjective ABX listening tests. Notably, it delivers significant quality improvements on both in-domain and out-of-domain speech, confirming its effectiveness and generalizability.

Technology Category

Application Category

📝 Abstract
The application of generative adversarial networks (GANs) has recently advanced speech super-resolution (SR) based on intermediate representations like mel-spectrograms. However, existing SR methods that typically rely on independently trained and concatenated networks may lead to inconsistent representations and poor speech quality, especially in out-of-domain scenarios. In this work, we propose HiFi-SR, a unified network that leverages end-to-end adversarial training to achieve high-fidelity speech super-resolution. Our model features a unified transformer-convolutional generator designed to seamlessly handle both the prediction of latent representations and their conversion into time-domain waveforms. The transformer network serves as a powerful encoder, converting low-resolution mel-spectrograms into latent space representations, while the convolutional network upscales these representations into high-resolution waveforms. To enhance high-frequency fidelity, we incorporate a multi-band, multi-scale time-frequency discriminator, along with a multi-scale mel-reconstruction loss in the adversarial training process. HiFi-SR is versatile, capable of upscaling any input speech signal between 4 kHz and 32 kHz to a 48 kHz sampling rate. Experimental results demonstrate that HiFi-SR significantly outperforms existing speech SR methods across both objective metrics and ABX preference tests, for both in-domain and out-of-domain scenarios (https://github.com/modelscope/ClearerVoice-Studio).
Problem

Research questions and friction points this paper is trying to address.

Speech Clarity
Audio Naturalness
High-frequency Sounds
Innovation

Methods, ideas, or system contributions that make the work stand out.

Transformer
Convolutional GAN
Frequency Discriminator
Shengkui Zhao
Shengkui Zhao
Senior Algorithm Expert, Alibaba Group
Speech processing and large models
K
Kun Zhou
Tongyi Lab, Alibaba Group, Singapore
Z
Zexu Pan
Tongyi Lab, Alibaba Group, Singapore
Yukun Ma
Yukun Ma
Alibaba Group
ASRSLU
C
Chong Zhang
Tongyi Lab, Alibaba Group, Singapore
B
Bin Ma
Tongyi Lab, Alibaba Group, Singapore