🤖 AI Summary
Objective evaluation metrics for text-to-audio (TTA) synthesis lack strong correlation with human perceptual judgments; conventional metrics (e.g., Mel-cepstral distortion) exhibit weak correlation, while subjective evaluation remains costly and time-consuming.
Method: We propose a novel embedding-based objective metric that maps reference and synthesized audio into temporal embedding sequences, then jointly measures local and non-local structural similarity using a combination of p-norm and max-norm distances—thereby adapting BERTScore to the audio domain.
Contribution/Results: Extensive experiments across multiple TTA benchmarks demonstrate that our method achieves significantly higher Spearman correlation with mean opinion scores (MOS) than traditional metrics (e.g., STOI, PESQ). It improves both reliability and cross-dataset generalizability, offering an efficient, interpretable, and fully automated evaluation tool for TTA model development.
📝 Abstract
We propose a novel objective evaluation metric for synthesized audio in text-to-audio (TTA), aiming to improve the performance of TTA models. In TTA, subjective evaluation of the synthesized sound is an important, but its implementation requires monetary costs. Therefore, objective evaluation such as mel-cepstral distortion are used, but the correlation between these objective metrics and subjective evaluation values is weak. Our proposed objective evaluation metric, AudioBERTScore, calculates the similarity between embedding of the synthesized and reference sounds. The method is based not only on the max-norm used in conventional BERTScore but also on the $p$-norm to reflect the non-local nature of environmental sounds. Experimental results show that scores obtained by the proposed method have a higher correlation with subjective evaluation values than conventional metrics.