A Paradigm for Interpreting Metrics and Identifying Critical Errors in Automatic Speech Recognition

📅 2026-05-05

📈 Citations: 0

✨ Influential: 0

career value

182K/year

🤖 AI Summary

Traditional automatic speech recognition evaluation metrics, such as word error rate (WER) and character error rate (CER), fail to capture human perception of errors and neglect linguistic and semantic influences. This work proposes a novel paradigm that embeds any perception-oriented evaluation metric into the minimum edit distance (minED) framework to produce an intuitively interpretable equivalent error rate. For the first time, this approach translates human perceptual modeling into a comprehensible error rate format, enabling quantification of error severity from the perspective of human understanding. The resulting metric not only aligns closely with human judgments but also effectively identifies recognition errors that critically impact semantic comprehension.

📝 Abstract

The most commonly used metrics for evaluating automatic speech transcriptions, namely Word Error Rate (WER) and Character Error Rate (CER), have been heavily criticized for their poor correlation to human perception and their inability to take into account linguistic and semantic information. While metric-based embeddings, seeking to approximate human perception, have been proposed, their scores remain difficult to interpret, unlike WER and CER. In this article, we overcome this problem by proposing a paradigm that consists in incorporating a chosen metric into it in order to obtain an equivalent of the error rate: a Minimum Edit Distance (minED). This approach parallels transcription errors with their human perception, also allowing an original study of the severity of these errors from a human perspective.

Problem

Research questions and friction points this paper is trying to address.

Automatic Speech Recognition

Word Error Rate

Character Error Rate

human perception

evaluation metrics

Innovation

Methods, ideas, or system contributions that make the work stand out.

Minimum Edit Distance

interpretable metrics

automatic speech recognition