Transient Noise Removal via Diffusion-based Speech Inpainting

📅 2025-08-12
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This study addresses severe speech degradation or up to one-second complete signal loss caused by intense transient noises (e.g., fireworks, door slams). We propose PGDI, a diffusion-based speech restoration framework. Methodologically, PGDI introduces a novel phoneme-level classifier-guidance mechanism that enables content-aware, robust reconstruction without requiring textual or speaker identity priors—while simultaneously preserving speaker identity, prosody, and acoustic environmental characteristics. Experimental results demonstrate that PGDI consistently outperforms state-of-the-art methods across multi-speaker and multi-noise scenarios. Notably, it maintains high fidelity and naturalness even under prolonged, fully masked segments—setting a new paradigm for speech restoration in realistic, acoustically complex environments.

Technology Category

Application Category

📝 Abstract
In this paper, we present PGDI, a diffusion-based speech inpainting framework for restoring missing or severely corrupted speech segments. Unlike previous methods that struggle with speaker variability or long gap lengths, PGDI can accurately reconstruct gaps of up to one second in length while preserving speaker identity, prosody, and environmental factors such as reverberation. Central to this approach is classifier guidance, specifically phoneme-level guidance, which substantially improves reconstruction fidelity. PGDI operates in a speaker-independent manner and maintains robustness even when long segments are completely masked by strong transient noise, making it well-suited for real-world applications, such as fireworks, door slams, hammer strikes, and construction noise. Through extensive experiments across diverse speakers and gap lengths, we demonstrate PGDI's superior inpainting performance and its ability to handle challenging acoustic conditions. We consider both scenarios, with and without access to the transcript during inference, showing that while the availability of text further enhances performance, the model remains effective even in its absence. For audio samples, visit: https://mordehaym.github.io/PGDI/
Problem

Research questions and friction points this paper is trying to address.

Removing transient noise from corrupted speech segments
Reconstructing long speech gaps while preserving speaker identity
Handling speaker-independent speech inpainting with phoneme-level guidance
Innovation

Methods, ideas, or system contributions that make the work stand out.

Diffusion-based framework for speech inpainting
Phoneme-level classifier guidance improves fidelity
Speaker-independent robust long-gap reconstruction
🔎 Similar Papers
No similar papers found.