Fast Text-to-Audio Generation with Adversarial Post-Training

📅 2025-05-13

📈 Citations: 0

✨ Influential: 0

career value

209K/year

🤖 AI Summary

To address the high inference latency of text-to-audio generation models—hindering real-time creative applications—this paper proposes ARC (Adversarial Relativistic-Contrastive), the first knowledge-distillation-free adversarial post-training acceleration method. ARC uniquely integrates relativistic adversarial training with a novel contrastive discriminator objective, simultaneously enhancing prompt adherence and accelerating generation. Coupled with architectural optimizations of Stable Audio Open and cross-platform deployment adaptations (including H100 GPUs and edge devices), ARC achieves state-of-the-art efficiency: generating 12-second, 44.1-kHz stereo audio in just 75 ms on an H100 GPU, and approximately 7 seconds on mobile edge devices. This represents the fastest reported text-to-audio generation approach to date, offering unprecedented trade-offs among speed, audio fidelity, and deployment flexibility across heterogeneous hardware.

Technology Category

Application Category

📝 Abstract

Text-to-audio systems, while increasingly performant, are slow at inference time, thus making their latency unpractical for many creative applications. We present Adversarial Relativistic-Contrastive (ARC) post-training, the first adversarial acceleration algorithm for diffusion/flow models not based on distillation. While past adversarial post-training methods have struggled to compare against their expensive distillation counterparts, ARC post-training is a simple procedure that (1) extends a recent relativistic adversarial formulation to diffusion/flow post-training and (2) combines it with a novel contrastive discriminator objective to encourage better prompt adherence. We pair ARC post-training with a number optimizations to Stable Audio Open and build a model capable of generating $approx$12s of 44.1kHz stereo audio in $approx$75ms on an H100, and $approx$7s on a mobile edge-device, the fastest text-to-audio model to our knowledge.

Problem

Research questions and friction points this paper is trying to address.

Accelerates slow text-to-audio generation inference time

Improves prompt adherence in adversarial post-training methods

Enables real-time audio generation on edge devices

Innovation

Methods, ideas, or system contributions that make the work stand out.

Adversarial Relativistic-Contrastive post-training for acceleration

Combines relativistic adversarial with contrastive discriminator

Optimizes Stable Audio Open for ultra-fast generation

🔎 Similar Papers

No similar papers found.