SwinSRGAN: Swin Transformer-based Generative Adversarial Network for High-Fidelity Speech Super-Resolution

📅 2025-09-04

📈 Citations: 0

✨ Influential: 0

career value

198K/year

🤖 AI Summary

Existing speech super-resolution (SSR) methods suffer from three key limitations: (1) representation mismatch induced by two-stage mel-spectrogram–vocoder pipelines; (2) blurring of high-frequency details and hallucination artifacts caused by CNN-based generators; and (3) high computational cost and poor cross-sample-rate and cross-domain robustness in diffusion/flow-based models. This paper proposes an end-to-end high-fidelity SSR framework. We design a Swin Transformer–based U-Net generator to capture long-range time-frequency dependencies while preserving transient components. A dual-discriminator architecture is introduced—integrating time-domain multi-period discriminators (MPD/MSD) with a multi-band MDCT discriminator—alongside arcsinh spectral compression and sparsity regularization to enhance reconstruction fidelity. Extensive experiments demonstrate significant reductions in MSE and PESQ error across multiple benchmarks, substantial improvements in ABX preference scores, and superior zero-shot cross-dataset generalization over NVSR and mdctGAN. The framework supports one-click upscaling from 16/24/32 kHz to 48 kHz and enables real-time inference.

Technology Category

Application Category

📝 Abstract

Speech super-resolution (SR) reconstructs high-frequency content from low-resolution speech signals. Existing systems often suffer from representation mismatch in two-stage mel-vocoder pipelines and from over-smoothing of hallucinated high-band content by CNN-only generators. Diffusion and flow models are computationally expensive, and their robustness across domains and sampling rates remains limited. We propose SwinSRGAN, an end-to-end framework operating on Modified Discrete Cosine Transform (MDCT) magnitudes. It is a Swin Transformer-based U-Net that captures long-range spectro-temporal dependencies with a hybrid adversarial scheme combines time-domain MPD/MSD discriminators with a multi-band MDCT discriminator specialized for the high-frequency band. We employs a sparse-aware regularizer on arcsinh-compressed MDCT to better preserve transient components. The system upsamples inputs at various sampling rates to 48 kHz in a single pass and operates in real time. On standard benchmarks, SwinSRGAN reduces objective error and improves ABX preference scores. In zero-shot tests on HiFi-TTS without fine-tuning, it outperforms NVSR and mdctGAN, demonstrating strong generalization across datasets

Problem

Research questions and friction points this paper is trying to address.

Reconstructs high-frequency content from low-resolution speech signals

Addresses over-smoothing and representation mismatch in existing systems

Reduces computational cost while maintaining cross-domain robustness

Innovation

Methods, ideas, or system contributions that make the work stand out.

Swin Transformer-based U-Net for spectro-temporal dependencies

Hybrid adversarial scheme with multi-band MDCT discriminator

Sparse-aware regularizer on arcsinh-compressed MDCT

🔎 Similar Papers

Wave-U-Mamba: An End-To-End Framework For High-Quality And Efficient Speech Super Resolution