π€ AI Summary
Existing language-guided audio source separation (LASS) models are limited by single-function capability (extraction-only) and unintuitive prompting (text-only). This paper proposes PromptSep, the first LASS framework to incorporate voice imitation as a multimodal prompt, enabling both audio extraction and sound removal. Methodologically, we design a multimodal conditioning mechanism that jointly encodes textual queries and spoken exemplars, implement a conditional diffusion model for separation, and introduce Sketch2Soundβa data augmentation strategy leveraging sketchy audio cuesβto enhance generalization. On multiple benchmarks, PromptSep achieves state-of-the-art performance in voice-imitation-guided separation and sound removal, while remaining competitive in standard text-based LASS. The core contribution is the novel introduction of voice imitation as a conditional input for audio separation, enabling more natural, flexible, and interactive multimodal separation.
π Abstract
Recent breakthroughs in language-queried audio source separation (LASS) have shown that generative models can achieve higher separation audio quality than traditional masking-based approaches. However, two key limitations restrict their practical use: (1) users often require operations beyond separation, such as sound removal; and (2) relying solely on text prompts can be unintuitive for specifying sound sources. In this paper, we propose PromptSep to extend LASS into a broader framework for general-purpose sound separation. PromptSep leverages a conditional diffusion model enhanced with elaborated data simulation to enable both audio extraction and sound removal. To move beyond text-only queries, we incorporate vocal imitation as an additional and more intuitive conditioning modality for our model, by incorporating Sketch2Sound as a data augmentation strategy. Both objective and subjective evaluations on multiple benchmarks demonstrate that PromptSep achieves state-of-the-art performance in sound removal and vocal-imitation-guided source separation, while maintaining competitive results on language-queried source separation.