Presto! Distilling Steps and Layers for Accelerating Music Generation

๐Ÿ“… 2024-10-07
๐Ÿ›๏ธ arXiv.org
๐Ÿ“ˆ Citations: 3
โœจ Influential: 1
๐Ÿ“„ PDF
๐Ÿค– AI Summary
To address the low inference efficiency and the trade-off between quality and speed in text-to-music (TTM) generation using diffusion models, this paper proposes the first dual-path distillation framework tailored for the EDM architecture, jointly compressing both the number of sampling steps and per-step computational cost. Our key contributions are: (1) GAN-style Denoising Score Distribution Matching Distillation (DMD), enabling efficient knowledge transfer from teacher to student EDM models; (2) variance-aware latent state layer distillation, preserving generative diversity; and (3) an end-to-end, step-and-layer co-optimized dual-distillation mechanism. Evaluated on 44.1 kHz mono/stereo music generation (32-second clips), our method achieves inference latencies of 230 ms and 435 msโ€”accelerating inference by 10โ€“18ร— over the teacher model and outperforming prior SOTA by 15ร—โ€”while maintaining high audio fidelity and musical diversity, establishing the current fastest high-quality TTM system.

Technology Category

Application Category

๐Ÿ“ Abstract
Despite advances in diffusion-based text-to-music (TTM) methods, efficient, high-quality generation remains a challenge. We introduce Presto!, an approach to inference acceleration for score-based diffusion transformers via reducing both sampling steps and cost per step. To reduce steps, we develop a new score-based distribution matching distillation (DMD) method for the EDM-family of diffusion models, the first GAN-based distillation method for TTM. To reduce the cost per step, we develop a simple, but powerful improvement to a recent layer distillation method that improves learning via better preserving hidden state variance. Finally, we combine our step and layer distillation methods together for a dual-faceted approach. We evaluate our step and layer distillation methods independently and show each yield best-in-class performance. Our combined distillation method can generate high-quality outputs with improved diversity, accelerating our base model by 10-18x (230/435ms latency for 32 second mono/stereo 44.1kHz, 15x faster than comparable SOTA) -- the fastest high-quality TTM to our knowledge. Sound examples can be found at https://presto-music.github.io/web/.
Problem

Research questions and friction points this paper is trying to address.

Accelerating high-quality music generation from text
Reducing sampling steps via GAN-based distillation
Lowering per-step cost through improved layer distillation
Innovation

Methods, ideas, or system contributions that make the work stand out.

GAN-based distillation for TTM acceleration
Improved layer distillation via variance preservation
Combined step and layer distillation for speed
๐Ÿ”Ž Similar Papers
No similar papers found.