🤖 AI Summary
Supervised speech quality prediction models rely heavily on annotated data and suffer from limited generalizability; unsupervised approaches lack interpretability, especially in multilingual settings. This paper proposes an unsupervised evaluation framework based on hierarchical representation analysis to systematically investigate how self-supervised learning (SSL) and automatic speech recognition (ASR) pre-trained models encode fine-grained quality attributes—such as naturalness, intelligibility, and non-neural artifacts—in multilingual synthetic speech. We find that early SSL layers already exhibit strong correlation with human ratings, while deeper ASR layers better capture intelligibility and non-neural quality dimensions; reference data matching significantly impacts performance. Critically, our method requires no human annotations and achieves high-quality prediction across multiple languages. Moreover, it is the first to reveal the selective, layer-wise encoding of distinct perceptual quality dimensions in pre-trained models.
📝 Abstract
While supervised quality predictors for synthesized speech have demonstrated strong correlations with human ratings, their requirement for in-domain labeled training data hinders their generalization ability to new domains. Unsupervised approaches based on pretrained self-supervised learning (SSL) based models and automatic speech recognition (ASR) models are a promising alternative; however, little is known about how these models encode information about speech quality. Towards the goal of better understanding how different aspects of speech quality are encoded in a multilingual setting, we present a layer-wise analysis of multilingual pretrained speech models based on reference modeling. We find that features extracted from early SSL layers show correlations with human ratings of synthesized speech, and later layers of ASR models can predict quality of non-neural systems as well as intelligibility. We also demonstrate the importance of using well-matched reference data.