🤖 AI Summary
Traditional speech enhancement methods struggle to jointly remove multiple distortions, reconstruct high-frequency harmonics, and suppress breathy/whispered artifacts, while being constrained by broadband output limitations and high computational overhead. This paper introduces the first latent diffusion model (LDM) tailored for 48-kHz studio-quality speech restoration. Our approach innovatively constructs a joint time-frequency latent space, enabling end-to-end full-band reconstruction—including ultra-high-frequency components. Conditional denoising training facilitates unified modeling of diverse distortions, while explicit suppression of non-phonemic airflow artifacts is incorporated. Experiments demonstrate significant improvements over GAN- and conditional flow-matching-based baselines in non-intrusive metrics; superior performance in subjective listening tests; and state-of-the-art results on intrusive metrics (e.g., PESQ, STOI). This work overcomes two longstanding bottlenecks in high-resolution speech restoration: phoneme-level fidelity preservation and faithful high-frequency recovery.
📝 Abstract
Traditional speech enhancement methods often oversimplify the task of restoration by focusing on a single type of distortion. Generative models that handle multiple distortions frequently struggle with phone reconstruction and high-frequency harmonics, leading to breathing and gasping artifacts that reduce the intelligibility of reconstructed speech. These models are also computationally demanding, and many solutions are restricted to producing outputs in the wide-band frequency range, which limits their suitability for professional applications. To address these challenges, we propose Hi-ResLDM, a novel generative model based on latent diffusion designed to remove multiple distortions and restore speech recordings to studio quality, sampled at 48kHz. We benchmark Hi-ResLDM against state-of-the-art methods that leverage GAN and Conditional Flow Matching (CFM) components, demonstrating superior performance in regenerating high-frequency-band details. Hi-ResLDM not only excels in non-instrusive metrics but is also consistently preferred in human evaluation and performs competitively on intrusive evaluations, making it ideal for high-resolution speech restoration.