SAM Audio Judge: A Unified Multimodal Framework for Perceptual Evaluation of Audio Separation

📅 2026-01-27

📈 Citations: 0

✨ Influential: 0

career value

235K/year

🤖 AI Summary

This work addresses the limitations of existing audio separation evaluation metrics, which often misalign with human perception, offer coarse granularity, and rely on ground-truth reference signals, while subjective listening tests remain costly and non-scalable. To overcome these challenges, we propose SAM Audio Judge—a reference-free, multimodal, fine-grained objective evaluation framework that, for the first time, unifies assessment across speech, music, and general sound events. Our approach integrates textual, visual, and temporal span prompts to achieve perceptually aligned scoring through an end-to-end trainable multimodal architecture. It delivers fine-grained evaluations along four dimensions: recall, precision, fidelity, and overall quality, significantly improving correlation with human judgments. Furthermore, the method demonstrates broad applicability in data curation, pseudo-label generation, and model re-ranking. Code and pretrained models are publicly released.

Technology Category

Application Category

📝 Abstract

The performance evaluation remains a complex challenge in audio separation, and existing evaluation metrics are often misaligned with human perception, course-grained, relying on ground truth signals. On the other hand, subjective listening tests remain the gold standard for real-world evaluation, but they are expensive, time-consuming, and difficult to scale. This paper addresses the growing need for automated systems capable of evaluating audio separation without human intervention. The proposed evaluation metric, SAM Audio Judge (SAJ), is a multimodal fine-grained reference-free objective metric, which shows highly alignment with human perceptions. SAJ supports three audio domains (speech, music and general sound events) and three prompt inputs (text, visual and span), covering four different dimensions of evaluation (recall, percision, faithfulness, and overall). SAM Audio Judge also shows potential applications in data filtering, pseudo-labeling large datasets and reranking in audio separation models. We release our code and pre-trained models at: https://github.com/facebookresearch/sam-audio.

Problem

Research questions and friction points this paper is trying to address.

audio separation

perceptual evaluation

objective metric

reference-free

multimodal evaluation

Innovation

Methods, ideas, or system contributions that make the work stand out.

multimodal

reference-free

perceptual evaluation

audio separation

fine-grained

🔎 Similar Papers

A Reference-free Metric for Language-Queried Audio Source Separation using Contrastive Language-Audio Pretraining

2024-07-06arXiv.orgCitations: 3

Semi-intrusive audio evaluation: Casting non-intrusive assessment as a multi-modal text prediction task

2024-09-21arXiv.orgCitations: 0