š¤ AI Summary
Existing speech super-resolution (SSR) methods suffer from three key limitations: (1) representation mismatch induced by two-stage mel-spectrogramāvocoder pipelines; (2) blurring of high-frequency details and hallucination artifacts caused by CNN-based generators; and (3) high computational cost and poor cross-sample-rate and cross-domain robustness in diffusion/flow-based models. This paper proposes an end-to-end high-fidelity SSR framework. We design a Swin Transformerābased U-Net generator to capture long-range time-frequency dependencies while preserving transient components. A dual-discriminator architecture is introducedāintegrating time-domain multi-period discriminators (MPD/MSD) with a multi-band MDCT discriminatorāalongside arcsinh spectral compression and sparsity regularization to enhance reconstruction fidelity. Extensive experiments demonstrate significant reductions in MSE and PESQ error across multiple benchmarks, substantial improvements in ABX preference scores, and superior zero-shot cross-dataset generalization over NVSR and mdctGAN. The framework supports one-click upscaling from 16/24/32 kHz to 48 kHz and enables real-time inference.
š Abstract
Speech super-resolution (SR) reconstructs high-frequency content from low-resolution speech signals. Existing systems often suffer from representation mismatch in two-stage mel-vocoder pipelines and from over-smoothing of hallucinated high-band content by CNN-only generators. Diffusion and flow models are computationally expensive, and their robustness across domains and sampling rates remains limited. We propose SwinSRGAN, an end-to-end framework operating on Modified Discrete Cosine Transform (MDCT) magnitudes. It is a Swin Transformer-based U-Net that captures long-range spectro-temporal dependencies with a hybrid adversarial scheme combines time-domain MPD/MSD discriminators with a multi-band MDCT discriminator specialized for the high-frequency band. We employs a sparse-aware regularizer on arcsinh-compressed MDCT to better preserve transient components. The system upsamples inputs at various sampling rates to 48 kHz in a single pass and operates in real time. On standard benchmarks, SwinSRGAN reduces objective error and improves ABX preference scores. In zero-shot tests on HiFi-TTS without fine-tuning, it outperforms NVSR and mdctGAN, demonstrating strong generalization across datasets