🤖 AI Summary
Traditional automatic speech recognition evaluation metrics, such as word error rate (WER) and character error rate (CER), fail to capture human perception of errors and neglect linguistic and semantic influences. This work proposes a novel paradigm that embeds any perception-oriented evaluation metric into the minimum edit distance (minED) framework to produce an intuitively interpretable equivalent error rate. For the first time, this approach translates human perceptual modeling into a comprehensible error rate format, enabling quantification of error severity from the perspective of human understanding. The resulting metric not only aligns closely with human judgments but also effectively identifies recognition errors that critically impact semantic comprehension.
📝 Abstract
The most commonly used metrics for evaluating automatic speech transcriptions, namely Word Error Rate (WER) and Character Error Rate (CER), have been heavily criticized for their poor correlation to human perception and their inability to take into account linguistic and semantic information. While metric-based embeddings, seeking to approximate human perception, have been proposed, their scores remain difficult to interpret, unlike WER and CER. In this article, we overcome this problem by proposing a paradigm that consists in incorporating a chosen metric into it in order to obtain an equivalent of the error rate: a Minimum Edit Distance (minED). This approach parallels transcription errors with their human perception, also allowing an original study of the severity of these errors from a human perspective.