WER is Unaware: Assessing How ASR Errors Distort Clinical Understanding in Patient Facing Dialogue

📅 2025-11-20
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This study reveals that conventional Word Error Rate (WER) inadequately captures the clinical impact of ASR transcription errors, necessitating a clinical-risk–oriented evaluation paradigm. To address this, we introduce the first clinical-expert–curated benchmark for annotating the clinical impact of ASR errors and propose GEPA—a programmatic prompt optimization method—to enhance an LLM-as-a-Judge framework powered by Gemini-2.5-Pro for automated, scalable assessment of clinical risk. The optimized evaluator achieves 90% accuracy and Cohen’s κ = 0.816, matching human expert agreement. Key contributions include: (1) establishing the first clinically grounded ASR evaluation standard, overcoming WER’s clinical irrelevance; (2) introducing GEPA, a systematic, reproducible prompt engineering methodology; and (3) empirically validating the high-fidelity discriminative capability of LLMs in medical AI evaluation—thereby providing a novel benchmark and practical toolset for trustworthy clinical speech technologies.

Technology Category

Application Category

📝 Abstract
As Automatic Speech Recognition (ASR) is increasingly deployed in clinical dialogue, standard evaluations still rely heavily on Word Error Rate (WER). This paper challenges that standard, investigating whether WER or other common metrics correlate with the clinical impact of transcription errors. We establish a gold-standard benchmark by having expert clinicians compare ground-truth utterances to their ASR-generated counterparts, labeling the clinical impact of any discrepancies found in two distinct doctor-patient dialogue datasets. Our analysis reveals that WER and a comprehensive suite of existing metrics correlate poorly with the clinician-assigned risk labels (No, Minimal, or Significant Impact). To bridge this evaluation gap, we introduce an LLM-as-a-Judge, programmatically optimized using GEPA to replicate expert clinical assessment. The optimized judge (Gemini-2.5-Pro) achieves human-comparable performance, obtaining 90% accuracy and a strong Cohen's $κ$ of 0.816. This work provides a validated, automated framework for moving ASR evaluation beyond simple textual fidelity to a necessary, scalable assessment of safety in clinical dialogue.
Problem

Research questions and friction points this paper is trying to address.

Assessing how ASR errors distort clinical understanding in patient dialogues
Evaluating whether standard metrics correlate with clinical impact of transcription errors
Developing automated framework for safety assessment beyond textual fidelity
Innovation

Methods, ideas, or system contributions that make the work stand out.

LLM-as-a-Judge replaces WER for clinical assessment
Program optimization using GEPA replicates expert evaluation
Automated framework assesses clinical safety impact
🔎 Similar Papers
No similar papers found.
Z
Zachary Ellis
Ufonia Limited
J
Jared Joselowitz
Ufonia Limited
Y
Yash Deo
University of York
Y
Yajie He
Ufonia Limited
A
Anna Kalygina
Ufonia Limited
A
Aisling Higham
Ufonia Limited, Oxford University Hospitals
M
Mana Rahimzadeh
Moorfields Eye Hospital
Y
Yan Jia
University of York
Ibrahim Habli
Ibrahim Habli
Professor of Safety-Critical Systems at the University of York
SafetyAI SafetyAutonomous SystemsSoftware Engineering
E
Ernest Lim
Ufonia Limited, University of York