🤖 AI Summary
This work addresses the longstanding challenge in general-purpose speech enhancement of simultaneously achieving robustness and high perceptual quality. To this end, we propose a generative-predictive fusion framework that performs full-stack speech restoration in the self-supervised representation domain and enhancement in the spectrogram domain, followed by a bandwidth extension post-processing module that fuses outputs from both branches to upsample the signal to 48 kHz. Our approach is the first to jointly integrate generative and predictive enhancement pathways, leveraging neural vocoders, self-supervised representation learning, and bandwidth extension techniques. Evaluated in the ICASSP 2026 URGENT Challenge Track 1 blind test, the proposed method achieves state-of-the-art performance in both objective and subjective metrics, significantly outperforming existing approaches.
📝 Abstract
We introduce GAP-URGENet, a generative-predictive fusion framework developed for Track 1 of the ICASSP 2026 URGENT Challenge. The system integrates a generative branch, which performs full-stack speech restoration in a self-supervised representation domain and reconstructs the waveform via a neural vocoder, along with a predictive branch that performs spectrogram-domain enhancement, providing complementary cues. Outputs from both branches are fused by a post-processing module, which also performs bandwidth extension to generate the enhanced waveform at 48 kHz, later downsampled to the original sampling rate. This generative-predictive fusion improves robustness and perceptual quality, achieving top performance in the blind-test phase and ranking 1st in the objective evaluation. Audio examples are available at https://xiaobin-rong.github.io/gap-urgenet_demo.