🤖 AI Summary
Conventional automatic speech recognition (ASR) evaluation using word error rate (WER) fails to capture the practical impact of ASR errors on downstream large language model (LLM)-driven tasks. Method: We propose a task-oriented ASR evaluation framework that (1) systematically classifies ASR error types and analyzes their contextual reparability within LLM prompts; (2) defines a multidimensional metric integrating semantic severity of errors, LLM-based correction success rate, and end-task completion accuracy; and (3) validates the framework empirically on representative speech-to-LLM pipelines—including voice command execution and meeting summary generation. Results: Our framework significantly outperforms WER in reflecting ASR effectiveness in real-world LLM applications. It provides an interpretable, quantifiable assessment grounded in downstream task performance, enabling principled, task-aware ASR model development and optimization.
📝 Abstract
Automatic Speech Recognition (ASR) plays a crucial role in human-machine interaction and serves as an interface for a wide range of applications. Traditionally, ASR performance has been evaluated using Word Error Rate (WER), a metric that quantifies the number of insertions, deletions, and substitutions in the generated transcriptions. However, with the increasing adoption of large and powerful Large Language Models (LLMs) as the core processing component in various applications, the significance of different types of ASR errors in downstream tasks warrants further exploration. In this work, we analyze the capabilities of LLMs to correct errors introduced by ASRs and propose a new measure to evaluate ASR performance for LLM-powered applications.