🤖 AI Summary
Current self-supervised learning (SSL) for speech lacks efficient, low-resource methods for early-stage evaluation of pretraining quality.
Method: We propose a label-free, lightweight embedding quality prediction framework that replaces conventional pretraining loss with unsupervised proxies: clustering quality (e.g., silhouette score) and intrinsic rank structure (via singular value spectrum analysis) of learned embeddings.
Contribution/Results: This is the first work to jointly model clustering and rank as a unified representation of SSL speech pretraining quality. Our approach achieves significantly stronger correlation with downstream task performance—improving Spearman correlation by 32%—while requiring only one hour of unlabeled audio and reducing GPU computational cost by over 90%. Crucially, it eliminates dependence on labeled data and fine-tuning, enabling truly efficient, scalable, and practical early assessment of SSL speech models.
📝 Abstract
In Self-Supervised Learning (SSL), pre-training and evaluation are resource intensive. In the speech domain, current indicators of the quality of SSL models during pre-training, such as the loss, do not correlate well with downstream performance. Consequently, it is often difficult to gauge the final downstream performance in a cost efficient manner during pre-training. In this work, we propose unsupervised efficient methods that give insights into the quality of the pre-training of SSL speech models, namely, measuring the cluster quality and rank of the embeddings of the SSL model. Results show that measures of cluster quality and rank correlate better with downstream performance than the pre-training loss with only one hour of unlabeled audio, reducing the need for GPU hours and labeled data in SSL model evaluation.