IMPACT: Iterative Mask-based Parallel Decoding for Text-to-Audio Generation with Diffusion Modeling

📅 2025-05-31

📈 Citations: 0

✨ Influential: 0

career value

179K/year

🤖 AI Summary

Text-to-audio generation faces a fundamental trade-off between high fidelity and low inference latency: diffusion models (e.g., Tango, AudioLDM) achieve superior audio quality but suffer from slow sequential sampling, whereas discrete token masking models (e.g., MAGNET) enable fast parallel decoding yet yield limited perceptual quality due to quantization bottlenecks. This work introduces the first iterative masked parallel decoding paradigm operating directly in the *continuous* diffusion latent space—bypassing the representational constraints imposed by discrete tokenization. Our method jointly integrates diffusion-based modeling, continuous latent representation, iterative mask prediction, and parallel decoding. Evaluated on AudioCaps, it achieves state-of-the-art Fréchet Distance (FD) and Fréchet Audio Distance (FAD), while reducing inference latency by a significant margin compared to leading diffusion-based approaches.

Technology Category

Application Category

📝 Abstract

Text-to-audio generation synthesizes realistic sounds or music given a natural language prompt. Diffusion-based frameworks, including the Tango and the AudioLDM series, represent the state-of-the-art in text-to-audio generation. Despite achieving high audio fidelity, they incur significant inference latency due to the slow diffusion sampling process. MAGNET, a mask-based model operating on discrete tokens, addresses slow inference through iterative mask-based parallel decoding. However, its audio quality still lags behind that of diffusion-based models. In this work, we introduce IMPACT, a text-to-audio generation framework that achieves high performance in audio quality and fidelity while ensuring fast inference. IMPACT utilizes iterative mask-based parallel decoding in a continuous latent space powered by diffusion modeling. This approach eliminates the fidelity constraints of discrete tokens while maintaining competitive inference speed. Results on AudioCaps demonstrate that IMPACT achieves state-of-the-art performance on key metrics including Fr'echet Distance (FD) and Fr'echet Audio Distance (FAD) while significantly reducing latency compared to prior models. The project website is available at https://audio-impact.github.io/.

Problem

Research questions and friction points this paper is trying to address.

Improving text-to-audio generation speed and quality

Reducing inference latency in diffusion-based audio synthesis

Combining mask-based parallel decoding with diffusion modeling

Innovation

Methods, ideas, or system contributions that make the work stand out.

Iterative mask-based parallel decoding

Continuous latent space diffusion modeling

High audio quality with fast inference

🔎 Similar Papers

Towards Diverse and Efficient Audio Captioning via Diffusion Models