🤖 AI Summary
Conventional ASR evaluation metrics (e.g., WER) over-penalize morphosyntactic variations that preserve semantic meaning—particularly problematic for morphologically rich, word-order-flexible Indian languages. Method: We propose LASER, the first multilingual, semantics-aware ASR scoring framework leveraging large language model (LLM) in-context learning. LASER combines zero-shot scoring using Gemini 2.5 Pro with a fine-tuned Llama 3 model trained on word-pair-level data to classify error penalty types; Hindi-based prompts generalize across Indian languages, enabling lightweight deployment. Contribution/Results: LASER achieves 94% correlation with human judgments and 89% accuracy in error-type classification. This work pioneers the systematic integration of LLM in-context learning into ASR semantic evaluation, significantly enhancing fairness and fine-grained analytical capability—especially for low-resource languages.
📝 Abstract
Standard ASR evaluation metrics like Word Error Rate (WER) tend to unfairly penalize morphological and syntactic nuances that do not significantly alter sentence semantics. We introduce an LLM-based scoring rubric LASER that leverages state-of-the-art LLMs' in-context learning abilities to learn from prompts with detailed examples. Hindi LASER scores using Gemini 2.5 Pro achieved a very high correlation score of 94% with human annotations. Hindi examples in the prompt were also effective in analyzing errors in other Indian languages such as Marathi, Kannada and Malayalam. We also demonstrate how a smaller LLM like Llama 3 can be finetuned on word-pair examples derived from reference and ASR predictions to predict what kind of penalty should be applied with close to 89% accuracy.