AudioBERTScore: Objective Evaluation of Environmental Sound Synthesis Based on Similarity of Audio embedding Sequences

📅 2025-07-01
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Objective evaluation metrics for text-to-audio (TTA) synthesis lack strong correlation with human perceptual judgments; conventional metrics (e.g., Mel-cepstral distortion) exhibit weak correlation, while subjective evaluation remains costly and time-consuming. Method: We propose a novel embedding-based objective metric that maps reference and synthesized audio into temporal embedding sequences, then jointly measures local and non-local structural similarity using a combination of p-norm and max-norm distances—thereby adapting BERTScore to the audio domain. Contribution/Results: Extensive experiments across multiple TTA benchmarks demonstrate that our method achieves significantly higher Spearman correlation with mean opinion scores (MOS) than traditional metrics (e.g., STOI, PESQ). It improves both reliability and cross-dataset generalizability, offering an efficient, interpretable, and fully automated evaluation tool for TTA model development.

Technology Category

Application Category

📝 Abstract
We propose a novel objective evaluation metric for synthesized audio in text-to-audio (TTA), aiming to improve the performance of TTA models. In TTA, subjective evaluation of the synthesized sound is an important, but its implementation requires monetary costs. Therefore, objective evaluation such as mel-cepstral distortion are used, but the correlation between these objective metrics and subjective evaluation values is weak. Our proposed objective evaluation metric, AudioBERTScore, calculates the similarity between embedding of the synthesized and reference sounds. The method is based not only on the max-norm used in conventional BERTScore but also on the $p$-norm to reflect the non-local nature of environmental sounds. Experimental results show that scores obtained by the proposed method have a higher correlation with subjective evaluation values than conventional metrics.
Problem

Research questions and friction points this paper is trying to address.

Proposes objective metric for text-to-audio synthesis evaluation
Addresses weak correlation between current metrics and subjective scores
Measures audio similarity using embedding sequences and p-norm
Innovation

Methods, ideas, or system contributions that make the work stand out.

Novel objective metric AudioBERTScore for TTA
Uses embedding similarity with p-norm adaptation
Higher correlation with subjective evaluations
🔎 Similar Papers
No similar papers found.
M
Minoru Kishi
Keio University, Japan
R
Ryosuke Sakai
Keio University, Japan
Shinnosuke Takamichi
Shinnosuke Takamichi
Keio University
Speech synthesis
Y
Yusuke Kanamori
The University of Tokyo, Japan
Y
Yuki Okamoto
The University of Tokyo, Japan