SAM Audio Judge: A Unified Multimodal Framework for Perceptual Evaluation of Audio Separation

📅 2026-01-27
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the limitations of existing audio separation evaluation metrics, which often misalign with human perception, offer coarse granularity, and rely on ground-truth reference signals, while subjective listening tests remain costly and non-scalable. To overcome these challenges, we propose SAM Audio Judge—a reference-free, multimodal, fine-grained objective evaluation framework that, for the first time, unifies assessment across speech, music, and general sound events. Our approach integrates textual, visual, and temporal span prompts to achieve perceptually aligned scoring through an end-to-end trainable multimodal architecture. It delivers fine-grained evaluations along four dimensions: recall, precision, fidelity, and overall quality, significantly improving correlation with human judgments. Furthermore, the method demonstrates broad applicability in data curation, pseudo-label generation, and model re-ranking. Code and pretrained models are publicly released.

Technology Category

Application Category

📝 Abstract
The performance evaluation remains a complex challenge in audio separation, and existing evaluation metrics are often misaligned with human perception, course-grained, relying on ground truth signals. On the other hand, subjective listening tests remain the gold standard for real-world evaluation, but they are expensive, time-consuming, and difficult to scale. This paper addresses the growing need for automated systems capable of evaluating audio separation without human intervention. The proposed evaluation metric, SAM Audio Judge (SAJ), is a multimodal fine-grained reference-free objective metric, which shows highly alignment with human perceptions. SAJ supports three audio domains (speech, music and general sound events) and three prompt inputs (text, visual and span), covering four different dimensions of evaluation (recall, percision, faithfulness, and overall). SAM Audio Judge also shows potential applications in data filtering, pseudo-labeling large datasets and reranking in audio separation models. We release our code and pre-trained models at: https://github.com/facebookresearch/sam-audio.
Problem

Research questions and friction points this paper is trying to address.

audio separation
perceptual evaluation
objective metric
reference-free
multimodal evaluation
Innovation

Methods, ideas, or system contributions that make the work stand out.

multimodal
reference-free
perceptual evaluation
audio separation
fine-grained
🔎 Similar Papers
No similar papers found.