🤖 AI Summary
This work addresses the challenges of inefficiency, limited controllability, and redundant generation in variable-length audio synthesis and editing by introducing a family of fast latent diffusion models—small, medium, and large—built upon a novel semantic-acoustic autoencoder. The proposed approach constructs a compact latent space that jointly preserves semantic structure and acoustic fidelity, enhanced by an adversarial post-training mechanism. This enables significantly reduced inference steps while improving prompt adherence and audio quality. The models support efficient generation, inpainting, and continuation, producing high-fidelity audio in under two seconds on an H200 GPU or within several seconds on a MacBook Pro with an M4 chip. The small and medium models, along with the complete training and inference pipeline, are publicly released to facilitate efficient deployment on consumer-grade hardware.
📝 Abstract
Stable Audio 3 is a family of fast latent diffusion models (small, medium, large) for variable-length audio generation and editing. Since our models can generate several minutes of audio, variable-length generations are key to avoid the cost of producing full-length generations for short sounds. We also support inpainting, enabling targeted audio editing and the continuation of short recordings. Our latent diffusion models operate on top of a novel semantic-acoustic autoencoder that projects audio into a compact latent space, enabling efficient diffusion-based generation while preserving audio fidelity and encouraging semantic structure in the latent. Finally, we run adversarial post-training to both accelerate inference and improve generation quality, reducing the number of inference steps while improving fidelity and prompt adherence. Stable Audio 3 models are trained on licensed and Creative Commons data to generate music and sounds in less than a 2s on an H200 GPU and less than a few seconds on a MacBook Pro M4. We release the weights of small and medium, that can run on consumer-grade hardware, together with their training and inference pipeline.