🤖 AI Summary
Current audio separation models suffer from domain specificity (e.g., speech- or music-only) or reliance on single-modal prompts (e.g., text only), lacking unified support for multimodal prompts—including text, visual masks, and temporal spans. To address this, we propose the first general-purpose, multimodal-controllable audio separation foundation model, built upon a diffusion Transformer architecture. Our approach introduces flow matching-based training, cross-domain large-scale pretraining, and a novel multimodal prompt fusion mechanism. We further construct the first real-world multimodal-annotated benchmark for audio separation and design a reference-free, human-perception-aware evaluation model. Experiments demonstrate that our model achieves state-of-the-art performance across diverse benchmarks—including general sound, speech, music, and instrument separation—significantly outperforming both domain-specific and existing general-purpose methods. This enables truly flexible, controllable separation in open acoustic scenarios.
📝 Abstract
General audio source separation is a key capability for multimodal AI systems that can perceive and reason about sound. Despite substantial progress in recent years, existing separation models are either domain-specific, designed for fixed categories such as speech or music, or limited in controllability, supporting only a single prompting modality such as text. In this work, we present SAM Audio, a foundation model for general audio separation that unifies text, visual, and temporal span prompting within a single framework. Built on a diffusion transformer architecture, SAM Audio is trained with flow matching on large-scale audio data spanning speech, music, and general sounds, and can flexibly separate target sources described by language, visual masks, or temporal spans. The model achieves state-of-the-art performance across a diverse suite of benchmarks, including general sound, speech, music, and musical instrument separation in both in-the-wild and professionally produced audios, substantially outperforming prior general-purpose and specialized systems. Furthermore, we introduce a new real-world separation benchmark with human-labeled multimodal prompts and a reference-free evaluation model that correlates strongly with human judgment.