AudioBERTScore: Objective Evaluation of Environmental Sound Synthesis Based on Similarity of Audio embedding Sequences

📅 2025-07-01

📈 Citations: 0

✨ Influential: 0

career value

202K/year

🤖 AI Summary

Objective evaluation metrics for text-to-audio (TTA) synthesis lack strong correlation with human perceptual judgments; conventional metrics (e.g., Mel-cepstral distortion) exhibit weak correlation, while subjective evaluation remains costly and time-consuming. Method: We propose a novel embedding-based objective metric that maps reference and synthesized audio into temporal embedding sequences, then jointly measures local and non-local structural similarity using a combination of p-norm and max-norm distances—thereby adapting BERTScore to the audio domain. Contribution/Results: Extensive experiments across multiple TTA benchmarks demonstrate that our method achieves significantly higher Spearman correlation with mean opinion scores (MOS) than traditional metrics (e.g., STOI, PESQ). It improves both reliability and cross-dataset generalizability, offering an efficient, interpretable, and fully automated evaluation tool for TTA model development.

Technology Category

Application Category

📝 Abstract

We propose a novel objective evaluation metric for synthesized audio in text-to-audio (TTA), aiming to improve the performance of TTA models. In TTA, subjective evaluation of the synthesized sound is an important, but its implementation requires monetary costs. Therefore, objective evaluation such as mel-cepstral distortion are used, but the correlation between these objective metrics and subjective evaluation values is weak. Our proposed objective evaluation metric, AudioBERTScore, calculates the similarity between embedding of the synthesized and reference sounds. The method is based not only on the max-norm used in conventional BERTScore but also on the $p$-norm to reflect the non-local nature of environmental sounds. Experimental results show that scores obtained by the proposed method have a higher correlation with subjective evaluation values than conventional metrics.

Problem

Research questions and friction points this paper is trying to address.

Proposes objective metric for text-to-audio synthesis evaluation

Addresses weak correlation between current metrics and subjective scores

Measures audio similarity using embedding sequences and p-norm

Innovation

Methods, ideas, or system contributions that make the work stand out.

Novel objective metric AudioBERTScore for TTA

Uses embedding similarity with p-norm adaptation

Higher correlation with subjective evaluations

🔎 Similar Papers

No similar papers found.

Anthropic

$350,000—$500,000 USD

San Francisco, CA, USA

Research Scientist Intern, Multimodal AI (PhD)