High-Resolution Speech Restoration with Latent Diffusion Model

📅 2024-09-17

🏛️ arXiv.org

📈 Citations: 0

✨ Influential: 0

career value

202K/year

🤖 AI Summary

Traditional speech enhancement methods struggle to jointly remove multiple distortions, reconstruct high-frequency harmonics, and suppress breathy/whispered artifacts, while being constrained by broadband output limitations and high computational overhead. This paper introduces the first latent diffusion model (LDM) tailored for 48-kHz studio-quality speech restoration. Our approach innovatively constructs a joint time-frequency latent space, enabling end-to-end full-band reconstruction—including ultra-high-frequency components. Conditional denoising training facilitates unified modeling of diverse distortions, while explicit suppression of non-phonemic airflow artifacts is incorporated. Experiments demonstrate significant improvements over GAN- and conditional flow-matching-based baselines in non-intrusive metrics; superior performance in subjective listening tests; and state-of-the-art results on intrusive metrics (e.g., PESQ, STOI). This work overcomes two longstanding bottlenecks in high-resolution speech restoration: phoneme-level fidelity preservation and faithful high-frequency recovery.

Technology Category

Application Category

📝 Abstract

Traditional speech enhancement methods often oversimplify the task of restoration by focusing on a single type of distortion. Generative models that handle multiple distortions frequently struggle with phone reconstruction and high-frequency harmonics, leading to breathing and gasping artifacts that reduce the intelligibility of reconstructed speech. These models are also computationally demanding, and many solutions are restricted to producing outputs in the wide-band frequency range, which limits their suitability for professional applications. To address these challenges, we propose Hi-ResLDM, a novel generative model based on latent diffusion designed to remove multiple distortions and restore speech recordings to studio quality, sampled at 48kHz. We benchmark Hi-ResLDM against state-of-the-art methods that leverage GAN and Conditional Flow Matching (CFM) components, demonstrating superior performance in regenerating high-frequency-band details. Hi-ResLDM not only excels in non-instrusive metrics but is also consistently preferred in human evaluation and performs competitively on intrusive evaluations, making it ideal for high-resolution speech restoration.

Problem

Research questions and friction points this paper is trying to address.

Restores studio-quality speech at 48kHz

Removes multiple distortions effectively

Enhances high-frequency harmonics accurately

Innovation

Methods, ideas, or system contributions that make the work stand out.

Latent diffusion model

Studio quality restoration

High-frequency detail regeneration

🔎 Similar Papers

No similar papers found.

💼 Related Jobs

AI Inference Engineer - Speech

Zoom Video Communications Inc.

$151,800.00 - $332,200.00

San Jose (CA) / Seattle (WA)

AI Research Scientist - Meta Superintelligence Labs (PhD)