Meta Audiobox Aesthetics: Unified Automatic Quality Assessment for Speech, Music, and Sound

📅 2025-02-07
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the challenge of automated audio aesthetic quality assessment by proposing the first unified no-reference cross-domain (speech/music/environmental sound) aesthetic scoring framework. Methodologically, it introduces a novel four-dimensional perceptual disentanglement annotation scheme and constructs a single-sample prediction model based on self-supervised representations and multi-task regression, integrating perception-driven feature extraction with culture-robust aesthetic modeling to enable fine-grained, interpretable aesthetic quantification. Evaluated across diverse audio domains, the predicted scores achieve high correlation with human Mean Opinion Scores (MOS), yielding Spearman’s ρ > 0.89—significantly outperforming existing approaches. To foster reproducibility and downstream applications, the project open-sources the trained models, annotated dataset, and implementation code, providing a reliable tool for audio data filtering, pseudo-label generation, and generative audio evaluation.

Technology Category

Application Category

📝 Abstract
The quantification of audio aesthetics remains a complex challenge in audio processing, primarily due to its subjective nature, which is influenced by human perception and cultural context. Traditional methods often depend on human listeners for evaluation, leading to inconsistencies and high resource demands. This paper addresses the growing need for automated systems capable of predicting audio aesthetics without human intervention. Such systems are crucial for applications like data filtering, pseudo-labeling large datasets, and evaluating generative audio models, especially as these models become more sophisticated. In this work, we introduce a novel approach to audio aesthetic evaluation by proposing new annotation guidelines that decompose human listening perspectives into four distinct axes. We develop and train no-reference, per-item prediction models that offer a more nuanced assessment of audio quality. Our models are evaluated against human mean opinion scores (MOS) and existing methods, demonstrating comparable or superior performance. This research not only advances the field of audio aesthetics but also provides open-source models and datasets to facilitate future work and benchmarking. We release our code and pre-trained model at: https://github.com/facebookresearch/audiobox-aesthetics
Problem

Research questions and friction points this paper is trying to address.

Automated prediction of audio aesthetics
Reduces reliance on human evaluation
Enhances quality assessment in audio processing
Innovation

Methods, ideas, or system contributions that make the work stand out.

Develops no-reference audio quality models
Proposes new annotation guidelines for aesthetics
Releases open-source models and datasets
🔎 Similar Papers
No similar papers found.