SAM Audio: Segment Anything in Audio

📅 2025-12-19

📈 Citations: 1

✨ Influential: 0

career value

214K/year

🤖 AI Summary

Current audio separation models suffer from domain specificity (e.g., speech- or music-only) or reliance on single-modal prompts (e.g., text only), lacking unified support for multimodal prompts—including text, visual masks, and temporal spans. To address this, we propose the first general-purpose, multimodal-controllable audio separation foundation model, built upon a diffusion Transformer architecture. Our approach introduces flow matching-based training, cross-domain large-scale pretraining, and a novel multimodal prompt fusion mechanism. We further construct the first real-world multimodal-annotated benchmark for audio separation and design a reference-free, human-perception-aware evaluation model. Experiments demonstrate that our model achieves state-of-the-art performance across diverse benchmarks—including general sound, speech, music, and instrument separation—significantly outperforming both domain-specific and existing general-purpose methods. This enables truly flexible, controllable separation in open acoustic scenarios.

Technology Category

Application Category

📝 Abstract

General audio source separation is a key capability for multimodal AI systems that can perceive and reason about sound. Despite substantial progress in recent years, existing separation models are either domain-specific, designed for fixed categories such as speech or music, or limited in controllability, supporting only a single prompting modality such as text. In this work, we present SAM Audio, a foundation model for general audio separation that unifies text, visual, and temporal span prompting within a single framework. Built on a diffusion transformer architecture, SAM Audio is trained with flow matching on large-scale audio data spanning speech, music, and general sounds, and can flexibly separate target sources described by language, visual masks, or temporal spans. The model achieves state-of-the-art performance across a diverse suite of benchmarks, including general sound, speech, music, and musical instrument separation in both in-the-wild and professionally produced audios, substantially outperforming prior general-purpose and specialized systems. Furthermore, we introduce a new real-world separation benchmark with human-labeled multimodal prompts and a reference-free evaluation model that correlates strongly with human judgment.

Problem

Research questions and friction points this paper is trying to address.

Unifies text, visual, and temporal span prompting for audio separation

Addresses limitations of domain-specific or single-modality separation models

Achieves state-of-the-art performance across diverse audio benchmarks

Innovation

Methods, ideas, or system contributions that make the work stand out.

Unified multimodal prompting framework for audio separation

Diffusion transformer architecture with flow matching training

State-of-the-art performance across diverse audio benchmarks

🔎 Similar Papers

No similar papers found.