π€ AI Summary
This work addresses the challenge of reliably comparing speech generation systemsβ performance. We propose a preference-score-driven pairwise speech quality assessment method: first, a deep regression model predicts absolute Mean Opinion Score (MOS) values for two speech samples; then, a differentiable preference function aggregates these absolute scores into a relative preference score. Our key contribution is the first principled decoupling and joint modeling of absolute quality estimation and relative preference learning. To support this, we construct the first large-scale, MOS-synthesized pairwise speech preference dataset, significantly enhancing generalization in low-data regimes. By integrating data distillation and synthetic data augmentation, our method consistently outperforms baselines across diverse training configurations and cross-domain evaluations, achieving up to a 12.7% improvement in preference prediction accuracy. Results demonstrate strong robustness and broad applicability.
π Abstract
To compare the performance of two speech generation systems, one of the most effective approaches is estimating the preference score between their generated speech. This paper proposes a novel universal preference-score-based pairwise speech quality assessment (UPPSQA) model, aimed at predicting the preference score between paired speech samples to determine which one has better quality. The model first predicts the absolute mean opinion score (MOS) for the two speech samples separately, and then aggregates them into a relative preference score using a preference function. To address the scarcity of preference data, we also construct a new pairwise speech dataset based on a MOS dataset for experiments. Experimental results confirm that, whether in training scenarios with different data types and label conditions, or in both in-domain and out-of-domain test scenarios, the prediction accuracy of UPP-SQA outperforms that of the baseline models, demonstrating its universality.