🤖 AI Summary
Downstream evaluation of existing self-supervised speech models relies on additional fine-tuning or hyperparameter tuning, incurring high computational cost and lacking generalizability.
Method: We propose the first training-free, parameter-free, task-agnostic evaluation paradigm: leveraging large language models (LLMs) to score discrete speech token sequences generated by speech models—using only minimal domain-specific prompts—and computing average log-likelihood as the evaluation metric. Crucially, we discover that speech representations implicitly modeled by LLMs during inference can be directly transferred to speaker verification.
Contribution/Results: Our metric exhibits strong correlation with ASR performance (Pearson *r* > 0.9). Moreover, LLM-implicit embeddings achieve 92% of the performance of the current state-of-the-art unsupervised baseline on speaker verification. This work establishes an efficient, general-purpose pathway for evaluating speech representations without requiring model adaptation or task-specific supervision.
📝 Abstract
Self-Supervised Learning (SSL) has gained traction for its ability to learn rich representations with low labeling costs, applicable across diverse downstream tasks. However, assessing the downstream-task performance remains challenging due to the cost of extra training and evaluation. Existing methods for task-agnostic evaluation also require extra training or hyperparameter tuning. We propose a novel evaluation metric using large language models (LLMs). By inputting discrete token sequences and minimal domain cues derived from SSL models into LLMs, we obtain the mean log-likelihood; these cues guide in-context learning, rendering the score more reliable without extra training or hyperparameter tuning. Experimental results show a correlation between LLM-based scores and automatic speech recognition task. Additionally, our findings reveal that LLMs not only functions as an SSL evaluation tools but also provides inference-time embeddings that are useful for speaker verification task.