SwinSRGAN: Swin Transformer-based Generative Adversarial Network for High-Fidelity Speech Super-Resolution

šŸ“… 2025-09-04
šŸ“ˆ Citations: 0
✨ Influential: 0
šŸ“„ PDF
šŸ¤– AI Summary
Existing speech super-resolution (SSR) methods suffer from three key limitations: (1) representation mismatch induced by two-stage mel-spectrogram–vocoder pipelines; (2) blurring of high-frequency details and hallucination artifacts caused by CNN-based generators; and (3) high computational cost and poor cross-sample-rate and cross-domain robustness in diffusion/flow-based models. This paper proposes an end-to-end high-fidelity SSR framework. We design a Swin Transformer–based U-Net generator to capture long-range time-frequency dependencies while preserving transient components. A dual-discriminator architecture is introduced—integrating time-domain multi-period discriminators (MPD/MSD) with a multi-band MDCT discriminator—alongside arcsinh spectral compression and sparsity regularization to enhance reconstruction fidelity. Extensive experiments demonstrate significant reductions in MSE and PESQ error across multiple benchmarks, substantial improvements in ABX preference scores, and superior zero-shot cross-dataset generalization over NVSR and mdctGAN. The framework supports one-click upscaling from 16/24/32 kHz to 48 kHz and enables real-time inference.

Technology Category

Application Category

šŸ“ Abstract
Speech super-resolution (SR) reconstructs high-frequency content from low-resolution speech signals. Existing systems often suffer from representation mismatch in two-stage mel-vocoder pipelines and from over-smoothing of hallucinated high-band content by CNN-only generators. Diffusion and flow models are computationally expensive, and their robustness across domains and sampling rates remains limited. We propose SwinSRGAN, an end-to-end framework operating on Modified Discrete Cosine Transform (MDCT) magnitudes. It is a Swin Transformer-based U-Net that captures long-range spectro-temporal dependencies with a hybrid adversarial scheme combines time-domain MPD/MSD discriminators with a multi-band MDCT discriminator specialized for the high-frequency band. We employs a sparse-aware regularizer on arcsinh-compressed MDCT to better preserve transient components. The system upsamples inputs at various sampling rates to 48 kHz in a single pass and operates in real time. On standard benchmarks, SwinSRGAN reduces objective error and improves ABX preference scores. In zero-shot tests on HiFi-TTS without fine-tuning, it outperforms NVSR and mdctGAN, demonstrating strong generalization across datasets
Problem

Research questions and friction points this paper is trying to address.

Reconstructs high-frequency content from low-resolution speech signals
Addresses over-smoothing and representation mismatch in existing systems
Reduces computational cost while maintaining cross-domain robustness
Innovation

Methods, ideas, or system contributions that make the work stand out.

Swin Transformer-based U-Net for spectro-temporal dependencies
Hybrid adversarial scheme with multi-band MDCT discriminator
Sparse-aware regularizer on arcsinh-compressed MDCT
šŸ”Ž Similar Papers
No similar papers found.
J
Jiajun Yuan
National Engineering Research Center for Multimedia Software, School of Computer Science, Wuhan University, Wuhan, China
X
Xiaochen Wang
National Engineering Research Center for Multimedia Software, School of Computer Science, Wuhan University, Wuhan, China
Yuhang Xiao
Yuhang Xiao
Shenzhen University
Y
Yulin Wu
School of Artificial Intelligence, Jianghan University, Wuhan, China
Chenhao Hu
Chenhao Hu
Department of Psychology, Tsinghua University
environmental psychologyintervention studieshealth psychologysocial psychology
X
Xueyang Lv
Xiaomi Corporation, Beijing, China