Language Barriers: Evaluating Cross-Lingual Performance of CNN and Transformer Architectures for Speech Quality Estimation

📅 2025-02-18
📈 Citations: 0
✨ Influential: 0
📄 PDF
🤖 AI Summary
This work investigates the cross-lingual generalization capability of speech quality assessment models—specifically CNN-based NISQA and Transformer-based AST—without multilingual fine-tuning. Using exclusively English training data, we systematically evaluate their predictive performance across five languages (German, French, Chinese, Swedish, Dutch) on five perceptual dimensions: coloration, discontinuity, loudness, noise, and MOS. We quantitatively characterize how language-specific phonetic properties (e.g., tone, prosody) induce model bias, revealing discontinuity as a universal challenge (RMSE > 0.45 across all languages). Results show that AST exhibits significantly greater cross-lingual stability than NISQA; Chinese achieves the highest MOS prediction correlation (PCC > 0.85), whereas Swedish and Dutch incur the largest errors. This study establishes a reproducible cross-lingual benchmark and provides mechanistic insights into linguistic factors affecting speech quality modeling.

Technology Category

Application Category

📝 Abstract
Objective speech quality models aim to predict human-perceived speech quality using automated methods. However, cross-lingual generalization remains a major challenge, as Mean Opinion Scores (MOS) vary across languages due to linguistic, perceptual, and dataset-specific differences. A model trained primarily on English data may struggle to generalize to languages with different phonetic, tonal, and prosodic characteristics, leading to inconsistencies in objective assessments. This study investigates the cross-lingual performance of two speech quality models: NISQA, a CNN-based model, and a Transformer-based Audio Spectrogram Transformer (AST) model. Both models were trained exclusively on English datasets containing over 49,000 speech samples and subsequently evaluated on speech in German, French, Mandarin, Swedish, and Dutch. We analyze model performance using Pearson Correlation Coefficient (PCC) and Root Mean Square Error (RMSE) across five speech quality dimensions: coloration, discontinuity, loudness, noise, and MOS. Our findings show that while AST achieves a more stable cross-lingual performance, both models exhibit noticeable biases. Notably, Mandarin speech quality predictions correlate highly with human MOS scores, whereas Swedish and Dutch present greater prediction challenges. Discontinuities remain difficult to model across all languages. These results highlight the need for more balanced multilingual datasets and architecture-specific adaptations to improve cross-lingual generalization.
Problem

Research questions and friction points this paper is trying to address.

Cross-lingual speech quality estimation
CNN and Transformer performance comparison
Multilingual dataset and model adaptation needs
Innovation

Methods, ideas, or system contributions that make the work stand out.

CNN-based NISQA model
Transformer-based AST model
Cross-lingual speech quality evaluation
🔎 Similar Papers
No similar papers found.
Wafaa Wardah
Wafaa Wardah
Technische Universität Berlin
Artificial IntelligenceDeep LearningSpeech ScienceSignal ProcessingMachine Learning
T
Tuugcce Melike Koccak Buyuktacs
Quality and Usability Lab, Technische Universität Berlin, Germany
K
Kirill Shchegelskiy
Quality and Usability Lab, Technische Universität Berlin, Germany
S
Sebastian Moller
Quality and Usability Lab, Technische Universität Berlin, Germany; Deutsches Forschungszentrum fßr Kßnstliche Intelligenz, Speech and Language Technologies, Berlin, Germany
R
Robert P. Spang
Quality and Usability Lab, Technische Universität Berlin, Germany