🤖 AI Summary
Audio generation and editing methods are typically constrained by reliance on task-specific datasets. Method: This work pioneers the extension of Score Distillation Sampling (SDS) from image diffusion models to text-conditioned audio diffusion models. Leveraging a single pre-trained text-to-audio diffusion model, it performs prompt-driven optimization in the latent space via gradient-based distillation, augmented with physically grounded constraints—such as impact sound dynamics and FM synthesis parameter priors—to support diverse tasks including source separation, physics-informed synthesis, and parameter calibration. Contribution/Results: Experiments demonstrate that the framework achieves high-fidelity, highly controllable audio outputs across multiple manipulation tasks—without fine-tuning or additional training data. It validates the generality and robustness of generative prior distillation in the audio domain and substantially broadens the cross-modal applicability boundary of SDS.
📝 Abstract
We introduce Audio-SDS, a generalization of Score Distillation Sampling (SDS) to text-conditioned audio diffusion models. While SDS was initially designed for text-to-3D generation using image diffusion, its core idea of distilling a powerful generative prior into a separate parametric representation extends to the audio domain. Leveraging a single pretrained model, Audio-SDS enables a broad range of tasks without requiring specialized datasets. In particular, we demonstrate how Audio-SDS can guide physically informed impact sound simulations, calibrate FM-synthesis parameters, and perform prompt-specified source separation. Our findings illustrate the versatility of distillation-based methods across modalities and establish a robust foundation for future work using generative priors in audio tasks.