SA-SSL-MOS: Self-supervised Learning MOS Prediction with Spectral Augmentation for Generalized Multi-Rate Speech Assessment

📅 2026-02-16

📈 Citations: 0

✨ Influential: 0

career value

223K/year

🤖 AI Summary

This work addresses the limitation of existing self-supervised learning (SSL) models, which are typically pretrained on 16 kHz speech and thus struggle to accurately predict mean opinion scores (MOS) for higher sampling rate audio such as 48 kHz due to the omission of critical high-frequency information. To overcome this, the authors propose a novel SSL framework incorporating spectral augmentation and a parallel-branch architecture explicitly designed to model high-frequency features. The approach employs a two-stage training strategy: initial pretraining on large-scale 48 kHz data followed by fine-tuning on a smaller multi-sampling-rate dataset. By effectively integrating high-frequency cues with SSL-derived representations, the method significantly enhances both accuracy and generalization in multi-rate speech quality assessment.

Technology Category

Application Category

📝 Abstract

Designing a speech quality assessment (SQA) system for estimating mean-opinion-score (MOS) of multi-rate speech with varying sampling frequency (16-48 kHz) is a challenging task. The challenge arises due to the limited availability of a MOS-labeled training dataset comprising multi-rate speech samples. While self-supervised learning (SSL) models have been widely adopted in SQA to boost performance, a key limitation is that they are pretrained on 16 kHz speech and therefore discard high-frequency information present in higher sampling rates. To address this issue, we propose a spectrogram-augmented SSL method that incorporates high-frequency features (up to 48 kHz sampling rate) through a parallel-branch architecture. We further introduce a two-step training scheme: the model is first pre-trained on a large 48 kHz dataset and then fine-tuned on a smaller multi-rate dataset. Experimental results show that leveraging high-frequency information overlooked by SSL features is crucial for accurate multi-rate SQA, and that the proposed two-step training substantially improves generalization when multi-rate data is limited.

Problem

Research questions and friction points this paper is trying to address.

multi-rate speech

MOS prediction

speech quality assessment

high-frequency information

limited labeled data

Innovation

Methods, ideas, or system contributions that make the work stand out.

Self-supervised learning

Spectral augmentation

Multi-rate speech assessment