Layer-wise Analysis for Quality of Multilingual Synthesized Speech

📅 2025-09-05

📈 Citations: 0

✨ Influential: 0

career value

179K/year

🤖 AI Summary

Supervised speech quality prediction models rely heavily on annotated data and suffer from limited generalizability; unsupervised approaches lack interpretability, especially in multilingual settings. This paper proposes an unsupervised evaluation framework based on hierarchical representation analysis to systematically investigate how self-supervised learning (SSL) and automatic speech recognition (ASR) pre-trained models encode fine-grained quality attributes—such as naturalness, intelligibility, and non-neural artifacts—in multilingual synthetic speech. We find that early SSL layers already exhibit strong correlation with human ratings, while deeper ASR layers better capture intelligibility and non-neural quality dimensions; reference data matching significantly impacts performance. Critically, our method requires no human annotations and achieves high-quality prediction across multiple languages. Moreover, it is the first to reveal the selective, layer-wise encoding of distinct perceptual quality dimensions in pre-trained models.

Technology Category

Application Category

📝 Abstract

While supervised quality predictors for synthesized speech have demonstrated strong correlations with human ratings, their requirement for in-domain labeled training data hinders their generalization ability to new domains. Unsupervised approaches based on pretrained self-supervised learning (SSL) based models and automatic speech recognition (ASR) models are a promising alternative; however, little is known about how these models encode information about speech quality. Towards the goal of better understanding how different aspects of speech quality are encoded in a multilingual setting, we present a layer-wise analysis of multilingual pretrained speech models based on reference modeling. We find that features extracted from early SSL layers show correlations with human ratings of synthesized speech, and later layers of ASR models can predict quality of non-neural systems as well as intelligibility. We also demonstrate the importance of using well-matched reference data.

Problem

Research questions and friction points this paper is trying to address.

Analyzing how multilingual speech models encode quality information

Understanding layer-wise correlations between SSL/ASR features and human ratings

Investigating quality prediction for synthesized speech across languages

Innovation

Methods, ideas, or system contributions that make the work stand out.

Layer-wise analysis of multilingual pretrained speech models

Early SSL layers correlate with human speech ratings

Later ASR layers predict non-neural system quality

🔎 Similar Papers

No similar papers found.