🤖 AI Summary
This study addresses the inefficiency and poor reproducibility of subjective speech quality assessment (SSQA). To this end, we introduce SHEET, an open-source toolkit enabling end-to-end, data-driven training and evaluation of deep neural networks for accurate prediction of human-rated speech quality scores. SHEET establishes, for the first time, a unified multi-dataset (e.g., BVCC, NISQA) and multi-model framework, integrating plug-and-play pretrained self-supervised learning (SSL) models from Torch Hub and Hugging Face. Through systematic re-evaluation of SSL-MOS, we identify superior speech SSL representations: the best-performing SSL model surpasses the original SSL-MOS on both BVCC and NISQA benchmarks and matches state-of-the-art methods in performance. The toolkit is publicly deployed on Hugging Face Spaces, substantially lowering barriers to entry for SSQA research.
📝 Abstract
We introduce SHEET, a multi-purpose open-source toolkit designed to accelerate subjective speech quality assessment (SSQA) research. SHEET stands for the Speech Human Evaluation Estimation Toolkit, which focuses on data-driven deep neural network-based models trained to predict human-labeled quality scores of speech samples. SHEET provides comprehensive training and evaluation scripts, multi-dataset and multi-model support, as well as pre-trained models accessible via Torch Hub and HuggingFace Spaces. To demonstrate its capabilities, we re-evaluated SSL-MOS, a speech self-supervised learning (SSL)-based SSQA model widely used in recent scientific papers, on an extensive list of speech SSL models. Experiments were conducted on two representative SSQA datasets named BVCC and NISQA, and we identified the optimal speech SSL model, whose performance surpassed the original SSL-MOS implementation and was comparable to state-of-the-art methods.