Score Distillation Sampling for Audio: Source Separation, Synthesis, and Beyond

📅 2025-05-07

📈 Citations: 0

✨ Influential: 0

career value

221K/year

🤖 AI Summary

Audio generation and editing methods are typically constrained by reliance on task-specific datasets. Method: This work pioneers the extension of Score Distillation Sampling (SDS) from image diffusion models to text-conditioned audio diffusion models. Leveraging a single pre-trained text-to-audio diffusion model, it performs prompt-driven optimization in the latent space via gradient-based distillation, augmented with physically grounded constraints—such as impact sound dynamics and FM synthesis parameter priors—to support diverse tasks including source separation, physics-informed synthesis, and parameter calibration. Contribution/Results: Experiments demonstrate that the framework achieves high-fidelity, highly controllable audio outputs across multiple manipulation tasks—without fine-tuning or additional training data. It validates the generality and robustness of generative prior distillation in the audio domain and substantially broadens the cross-modal applicability boundary of SDS.

Technology Category

Application Category

📝 Abstract

We introduce Audio-SDS, a generalization of Score Distillation Sampling (SDS) to text-conditioned audio diffusion models. While SDS was initially designed for text-to-3D generation using image diffusion, its core idea of distilling a powerful generative prior into a separate parametric representation extends to the audio domain. Leveraging a single pretrained model, Audio-SDS enables a broad range of tasks without requiring specialized datasets. In particular, we demonstrate how Audio-SDS can guide physically informed impact sound simulations, calibrate FM-synthesis parameters, and perform prompt-specified source separation. Our findings illustrate the versatility of distillation-based methods across modalities and establish a robust foundation for future work using generative priors in audio tasks.

Problem

Research questions and friction points this paper is trying to address.

Extends Score Distillation Sampling to audio diffusion models

Enables diverse tasks without specialized datasets

Demonstrates applications in sound simulation and source separation

Innovation

Methods, ideas, or system contributions that make the work stand out.

Generalizes SDS to text-conditioned audio diffusion models

Uses single pretrained model for diverse audio tasks

Enables impact sound simulation and FM-synthesis calibration

🔎 Similar Papers

Compositional Audio Representation Learning