๐ค AI Summary
To address the low inference efficiency and the trade-off between quality and speed in text-to-music (TTM) generation using diffusion models, this paper proposes the first dual-path distillation framework tailored for the EDM architecture, jointly compressing both the number of sampling steps and per-step computational cost. Our key contributions are: (1) GAN-style Denoising Score Distribution Matching Distillation (DMD), enabling efficient knowledge transfer from teacher to student EDM models; (2) variance-aware latent state layer distillation, preserving generative diversity; and (3) an end-to-end, step-and-layer co-optimized dual-distillation mechanism. Evaluated on 44.1 kHz mono/stereo music generation (32-second clips), our method achieves inference latencies of 230 ms and 435 msโaccelerating inference by 10โ18ร over the teacher model and outperforming prior SOTA by 15รโwhile maintaining high audio fidelity and musical diversity, establishing the current fastest high-quality TTM system.
๐ Abstract
Despite advances in diffusion-based text-to-music (TTM) methods, efficient, high-quality generation remains a challenge. We introduce Presto!, an approach to inference acceleration for score-based diffusion transformers via reducing both sampling steps and cost per step. To reduce steps, we develop a new score-based distribution matching distillation (DMD) method for the EDM-family of diffusion models, the first GAN-based distillation method for TTM. To reduce the cost per step, we develop a simple, but powerful improvement to a recent layer distillation method that improves learning via better preserving hidden state variance. Finally, we combine our step and layer distillation methods together for a dual-faceted approach. We evaluate our step and layer distillation methods independently and show each yield best-in-class performance. Our combined distillation method can generate high-quality outputs with improved diversity, accelerating our base model by 10-18x (230/435ms latency for 32 second mono/stereo 44.1kHz, 15x faster than comparable SOTA) -- the fastest high-quality TTM to our knowledge. Sound examples can be found at https://presto-music.github.io/web/.