🤖 AI Summary
This work addresses the challenges of catastrophic forgetting and overfitting commonly encountered when fine-tuning self-supervised learning (SSL) models for mean opinion score (MOS) prediction. To mitigate these issues, the authors propose an inter-layer self-distillation mechanism that clusters hidden representations from multiple layers of the SSL model to generate token IDs, which serve as auxiliary supervision signals during fine-tuning. This approach establishes layer-wise self-distillation targets within a multi-task learning framework, effectively preserving pre-trained knowledge while enhancing knowledge transfer and generalization. Experimental results demonstrate that the proposed method significantly outperforms standard fine-tuning strategies in both in-domain and out-of-domain MOS prediction tasks, achieving improved accuracy and robustness.
📝 Abstract
With the advancement of self-supervised learning (SSL), fine-tuning pretrained SSL models for mean opinion score (MOS) prediction has achieved state-of-the-art performance. However, during fine-tuning, these SSL-based MOS prediction models often suffer from catastrophic forgetting of the pretrained knowledge and tend to overfit the training set, resulting in poor generalization performance. In this study, we propose DistilMOS, a novel method that learns to predict not only MOS but also token IDs obtained by clustering the hidden representations of each layer in the pretrained SSL model. These layer-wise token targets serve as self-distillation signals that enables the MOS prediction model to extract rich internal knowledge from SSL models, enhancing both prediction accuracy and generalization capability. Experimental evaluations demonstrate that our method significantly outperforms standard SSL-based MOS prediction models on both in-domain and out-of-domain evaluations, verifying the effectiveness and practicality of the proposed method.