Evaluating Self-Supervised Speech Models via Text-Based LLMS

📅 2025-10-05

📈 Citations: 0

✨ Influential: 0

career value

166K/year

🤖 AI Summary

Downstream evaluation of existing self-supervised speech models relies on additional fine-tuning or hyperparameter tuning, incurring high computational cost and lacking generalizability. Method: We propose the first training-free, parameter-free, task-agnostic evaluation paradigm: leveraging large language models (LLMs) to score discrete speech token sequences generated by speech models—using only minimal domain-specific prompts—and computing average log-likelihood as the evaluation metric. Crucially, we discover that speech representations implicitly modeled by LLMs during inference can be directly transferred to speaker verification. Contribution/Results: Our metric exhibits strong correlation with ASR performance (Pearson *r* > 0.9). Moreover, LLM-implicit embeddings achieve 92% of the performance of the current state-of-the-art unsupervised baseline on speaker verification. This work establishes an efficient, general-purpose pathway for evaluating speech representations without requiring model adaptation or task-specific supervision.

Technology Category

Application Category

📝 Abstract

Self-Supervised Learning (SSL) has gained traction for its ability to learn rich representations with low labeling costs, applicable across diverse downstream tasks. However, assessing the downstream-task performance remains challenging due to the cost of extra training and evaluation. Existing methods for task-agnostic evaluation also require extra training or hyperparameter tuning. We propose a novel evaluation metric using large language models (LLMs). By inputting discrete token sequences and minimal domain cues derived from SSL models into LLMs, we obtain the mean log-likelihood; these cues guide in-context learning, rendering the score more reliable without extra training or hyperparameter tuning. Experimental results show a correlation between LLM-based scores and automatic speech recognition task. Additionally, our findings reveal that LLMs not only functions as an SSL evaluation tools but also provides inference-time embeddings that are useful for speaker verification task.

Problem

Research questions and friction points this paper is trying to address.

Evaluating self-supervised speech models without extra training

Using LLMs to assess SSL model performance via likelihood

Providing inference-time embeddings for speaker verification tasks

Innovation

Methods, ideas, or system contributions that make the work stand out.

Using LLMs to evaluate SSL models without training

Inputting discrete tokens and domain cues into LLMs

LLMs provide inference-time embeddings for speaker verification

🔎 Similar Papers

MS-HuBERT: Mitigating Pre-training and Inference Mismatch in Masked Language Modelling methods for learning Speech Representations