Fast Text-to-Audio Generation with One-Step Sampling via Energy-Scoring and Auxiliary Contextual Representation Distillation

📅 2026-04-30
📈 Citations: 0
Influential: 0
📄 PDF

career value

215K/year
🤖 AI Summary
This work addresses the high latency inherent in autoregressive diffusion models for text-to-audio generation, which stems from multi-step sampling. To overcome this limitation, the authors propose a one-step sampling framework that uniquely integrates an energy distance-based training objective with representation-level knowledge distillation. Specifically, an energy-based scoring head directly maps Gaussian noise to the audio latent space, while a masked autoregressive model distills contextual representations to preserve strong conditional modeling capabilities. Evaluated on the AudioCaps dataset, the proposed method outperforms existing one-step generation approaches in both objective and subjective metrics. Moreover, it achieves an 8.5× speedup over the state-of-the-art autoregressive diffusion system IMPACT, while maintaining highly competitive audio quality.
📝 Abstract
Autoregressive (AR) models with diffusion heads have recently achieved strong text-to-audio performance, yet their iterative decoding and multi-step sampling process introduce high-latency issues. To address this bottleneck, we propose a one-step sampling framework that combines an energy-distance training objective with representation-level distillation. An energy-scoring head maps Gaussian noise directly to audio latents in one step, eliminating the need for a costly recursive diffusion sampling process, while distillation from a masked autoregressive (MAR) text-to-audio model preserves the strong conditioning learned during diffusion training. On the AudioCaps benchmark, our method consistently outperforms prior one-step baselines such as ConsistencyTTA, SoundCTM, AudioLCM and AudioTurbo, on both objective and subjective metrics, while substantially narrowing the quality gap to AR diffusion systems with multi-step sampling. Compared to the state-of-the-art AR diffusion system, IMPACT, our approach achieves up to $8.5$x faster batch inference with highly competitive audio quality. These results demonstrate that combining energy-distance training with representation-level distillation provides an effective recipe for fast, high-quality text-to-audio synthesis.
Problem

Research questions and friction points this paper is trying to address.

text-to-audio
autoregressive models
diffusion models
high-latency
one-step sampling
Innovation

Methods, ideas, or system contributions that make the work stand out.

one-step sampling
energy-scoring
representation distillation
text-to-audio synthesis
diffusion acceleration