Audio Inpanting using Discrete Diffusion Model

📅 2025-07-11

📈 Citations: 0

✨ Influential: 0

career value

220K/year

🤖 AI Summary

Audio restoration performance degrades significantly under long-duration gaps (>100 ms). To address this, we propose the first discrete diffusion model for audio inpainting: raw waveforms are first encoded into discrete token sequences using a pretrained audio tokenizer (e.g., EnCodec); then, a masked prediction diffusion process is formulated directly in the discrete latent space—bypassing the instability and semantic drift inherent in continuous-space modeling. This design overcomes the short-gap dependency of conventional waveform- or spectrogram-based diffusion models, enabling high-fidelity and semantically coherent reconstruction of extended gaps (300–500 ms). Experiments on MusicNet and MTG demonstrate that our method matches or surpasses state-of-the-art approaches in both objective metrics and subjective listening evaluations, particularly excelling in musical audio restoration with superior structural consistency and perceptual naturalness.

Technology Category

Application Category

📝 Abstract

Audio inpainting refers to the task of reconstructing missing segments in corrupted audio recordings. While prior approaches-including waveform and spectrogram-based diffusion models-have shown promising results for short gaps, they often degrade in quality when gaps exceed 100 milliseconds (ms). In this work, we introduce a novel inpainting method based on discrete diffusion modeling, which operates over tokenized audio representations produced by a pre-trained audio tokenizer. Our approach models the generative process directly in the discrete latent space, enabling stable and semantically coherent reconstruction of missing audio. We evaluate the method on the MusicNet dataset using both objective and perceptual metrics across gap durations up to 300 ms. We further evaluated our approach on the MTG dataset, extending the gap duration to 500 ms. Experimental results demonstrate that our method achieves competitive or superior performance compared to existing baselines, particularly for longer gaps, offering a robust solution for restoring degraded musical recordings. Audio examples of our proposed method can be found at https://iftach21.github.io/

Problem

Research questions and friction points this paper is trying to address.

Reconstruct missing segments in corrupted audio recordings

Improve audio inpainting quality for gaps over 100ms

Enable stable reconstruction of long missing audio segments

Innovation

Methods, ideas, or system contributions that make the work stand out.

Discrete diffusion model for audio inpainting

Tokenized audio representations for stable reconstruction

Superior performance on gaps up to 500 ms

🔎 Similar Papers

Audio Anti-Spoofing Detection: A Survey