Distillation and Pruning for Scalable Self-Supervised Representation-Based Speech Quality Assessment

📅 2025-02-07

📈 Citations: 0

✨ Influential: 0

career value

207K/year

🤖 AI Summary

This study addresses the high deployment overhead of non-intrusive speech quality assessment (NISQA) models based on XLS-R self-supervised representations. We propose a lightweight framework integrating knowledge distillation and MOS-guided data-driven structured pruning: a teacher model generates pseudo-labels to supervise a compact student model, while structured pruning reduces parameter count under MOS correlation constraints. To our knowledge, this is the first systematic comparison of distillation versus pruning across varying model scales. Results show that distillation significantly narrows the MOS correlation gap with the teacher—especially for ultra-small students (99% parameter reduction), achieving up to half the baseline performance gap—outperforming equivalently sized pruned models. Evaluated on a large-scale MOS dataset (>100k samples), our approach balances accuracy and deployability, establishing an efficient new paradigm for edge-device speech quality assessment.

Technology Category

Application Category

📝 Abstract

In this paper, we investigate distillation and pruning methods to reduce model size for non-intrusive speech quality assessment based on self-supervised representations. Our experiments build on XLS-R-SQA, a speech quality assessment model using wav2vec 2.0 XLS-R embeddings. We retrain this model on a large compilation of mean opinion score datasets, encompassing over 100,000 labeled clips. For distillation, using this model as a teacher, we generate pseudo-labels on unlabeled degraded speech signals and train student models of varying sizes. For pruning, we use a data-driven strategy. While data-driven pruning performs better at larger model sizes, distillation on unlabeled data is more effective for smaller model sizes. Distillation can halve the gap between the baseline's correlation with ground-truth MOS labels and that of the XLS-R-based teacher model, while reducing model size by two orders of magnitude compared to the teacher model.

Problem

Research questions and friction points this paper is trying to address.

Model size reduction for speech quality assessment

Distillation and pruning methods exploration

Enhancing efficiency of self-supervised representations

Innovation

Methods, ideas, or system contributions that make the work stand out.

Distillation reduces model size

Data-driven pruning enhances performance

Self-supervised representations improve efficiency

🔎 Similar Papers

No similar papers found.