High-Resolution Speech Restoration with Latent Diffusion Model

📅 2024-09-17
🏛️ arXiv.org
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Traditional speech enhancement methods struggle to jointly remove multiple distortions, reconstruct high-frequency harmonics, and suppress breathy/whispered artifacts, while being constrained by broadband output limitations and high computational overhead. This paper introduces the first latent diffusion model (LDM) tailored for 48-kHz studio-quality speech restoration. Our approach innovatively constructs a joint time-frequency latent space, enabling end-to-end full-band reconstruction—including ultra-high-frequency components. Conditional denoising training facilitates unified modeling of diverse distortions, while explicit suppression of non-phonemic airflow artifacts is incorporated. Experiments demonstrate significant improvements over GAN- and conditional flow-matching-based baselines in non-intrusive metrics; superior performance in subjective listening tests; and state-of-the-art results on intrusive metrics (e.g., PESQ, STOI). This work overcomes two longstanding bottlenecks in high-resolution speech restoration: phoneme-level fidelity preservation and faithful high-frequency recovery.

Technology Category

Application Category

📝 Abstract
Traditional speech enhancement methods often oversimplify the task of restoration by focusing on a single type of distortion. Generative models that handle multiple distortions frequently struggle with phone reconstruction and high-frequency harmonics, leading to breathing and gasping artifacts that reduce the intelligibility of reconstructed speech. These models are also computationally demanding, and many solutions are restricted to producing outputs in the wide-band frequency range, which limits their suitability for professional applications. To address these challenges, we propose Hi-ResLDM, a novel generative model based on latent diffusion designed to remove multiple distortions and restore speech recordings to studio quality, sampled at 48kHz. We benchmark Hi-ResLDM against state-of-the-art methods that leverage GAN and Conditional Flow Matching (CFM) components, demonstrating superior performance in regenerating high-frequency-band details. Hi-ResLDM not only excels in non-instrusive metrics but is also consistently preferred in human evaluation and performs competitively on intrusive evaluations, making it ideal for high-resolution speech restoration.
Problem

Research questions and friction points this paper is trying to address.

Restores studio-quality speech at 48kHz
Removes multiple distortions effectively
Enhances high-frequency harmonics accurately
Innovation

Methods, ideas, or system contributions that make the work stand out.

Latent diffusion model
Studio quality restoration
High-frequency detail regeneration
🔎 Similar Papers
No similar papers found.
T
Tushar Dhyani
Sony Europe B.V., Stuttgart, Germany
Florian Lux
Florian Lux
Speech Technology Scientist, AppTek
Speech SynthesisNatural Language ProcessingMachine LearningArtificial Intelligence
M
Michele Mancusi
Sony Europe B.V., Stuttgart, Germany
G
Giorgio Fabbro
Sony Europe B.V., Stuttgart, Germany
Fritz Hohl
Fritz Hohl
Sony Europe B.V., Stuttgart, Germany
N
Ngoc Thang Vu
University of Stuttgart, Germany