Evaluation of Automatic Speech Recognition Using Generative Large Language Models

📅 2026-04-23
📈 Citations: 0
Influential: 0
📄 PDF

career value

188K/year
🤖 AI Summary
Traditional automatic speech recognition (ASR) evaluation relies heavily on word error rate (WER), which often fails to capture semantic accuracy and aligns poorly with human perception. This work proposes a semantic evaluation framework leveraging generative large language models (LLMs), utilizing them for hypothesis selection, semantic embedding distance computation, and qualitative error analysis. For the first time, it systematically validates the effectiveness of decoder-only LLMs in ASR semantic evaluation. On the HATS dataset, the best-performing LLM achieves 92–94% agreement with human annotations in hypothesis selection—significantly outperforming WER (63%)—while its embedding performance rivals that of encoder-based models. Moreover, the approach offers an interpretable perspective on ASR output quality beyond surface-level token matching.

Technology Category

Application Category

📝 Abstract
Automatic Speech Recognition (ASR) is traditionally evaluated using Word Error Rate (WER), a metric that is insensitive to meaning. Embedding-based semantic metrics are better correlated with human perception, but decoder-based Large Language Models (LLMs) remain underexplored for this task. This paper evaluates their relevance through three approaches: (1) selecting the best hypothesis between two candidates, (2) computing semantic distance using generative embeddings, and (3) qualitative classification of errors. On the HATS dataset, the best LLMs achieve 92--94\% agreement with human annotators for hypothesis selection, compared to 63\% for WER, also outperforming semantic metrics. Embeddings from decoder-based LLMs show performance comparable to encoder models. Finally, LLMs offer a promising direction for interpretable and semantic ASR evaluation.
Problem

Research questions and friction points this paper is trying to address.

Automatic Speech Recognition
Word Error Rate
semantic evaluation
Large Language Models
human perception
Innovation

Methods, ideas, or system contributions that make the work stand out.

Generative Large Language Models
Semantic Evaluation
Automatic Speech Recognition
Embedding-based Metrics
Human-aligned Assessment
🔎 Similar Papers
No similar papers found.