PromptSep: Generative Audio Separation via Multimodal Prompting

πŸ“… 2025-11-06
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
Existing language-guided audio source separation (LASS) models are limited by single-function capability (extraction-only) and unintuitive prompting (text-only). This paper proposes PromptSep, the first LASS framework to incorporate voice imitation as a multimodal prompt, enabling both audio extraction and sound removal. Methodologically, we design a multimodal conditioning mechanism that jointly encodes textual queries and spoken exemplars, implement a conditional diffusion model for separation, and introduce Sketch2Soundβ€”a data augmentation strategy leveraging sketchy audio cuesβ€”to enhance generalization. On multiple benchmarks, PromptSep achieves state-of-the-art performance in voice-imitation-guided separation and sound removal, while remaining competitive in standard text-based LASS. The core contribution is the novel introduction of voice imitation as a conditional input for audio separation, enabling more natural, flexible, and interactive multimodal separation.

Technology Category

Application Category

πŸ“ Abstract
Recent breakthroughs in language-queried audio source separation (LASS) have shown that generative models can achieve higher separation audio quality than traditional masking-based approaches. However, two key limitations restrict their practical use: (1) users often require operations beyond separation, such as sound removal; and (2) relying solely on text prompts can be unintuitive for specifying sound sources. In this paper, we propose PromptSep to extend LASS into a broader framework for general-purpose sound separation. PromptSep leverages a conditional diffusion model enhanced with elaborated data simulation to enable both audio extraction and sound removal. To move beyond text-only queries, we incorporate vocal imitation as an additional and more intuitive conditioning modality for our model, by incorporating Sketch2Sound as a data augmentation strategy. Both objective and subjective evaluations on multiple benchmarks demonstrate that PromptSep achieves state-of-the-art performance in sound removal and vocal-imitation-guided source separation, while maintaining competitive results on language-queried source separation.
Problem

Research questions and friction points this paper is trying to address.

Extends audio separation beyond text queries using multimodal prompts
Enables both sound extraction and removal through conditional diffusion models
Incorporates vocal imitation as intuitive conditioning modality for source separation
Innovation

Methods, ideas, or system contributions that make the work stand out.

Uses conditional diffusion model for audio separation
Enables both audio extraction and sound removal
Incorporates vocal imitation as intuitive conditioning modality
πŸ”Ž Similar Papers
No similar papers found.