Development and multi-center evaluation of domain-adapted speech recognition for human-AI teaming in real-world gastrointestinal endoscopy

📅 2026-04-02

📈 Citations: 0

✨ Influential: 0

career value

219K/year

🤖 AI Summary

This work addresses the poor performance of general-purpose automatic speech recognition (ASR) systems in gastrointestinal endoscopy settings, where dense medical terminology and challenging acoustic conditions degrade accuracy. The authors propose EndoASR, a lightweight domain-adaptive ASR system built upon the Paraformer architecture (220M parameters), featuring a novel two-stage adaptation strategy driven by synthetic endoscopy reports to separately enhance language modeling and noise robustness, while enabling real-time edge deployment. Evaluated for the first time in multicenter real-world environments, EndoASR demonstrates strong generalization: in retrospective testing, it reduces character error rate (CER) from 20.52% to 14.14% and improves medical term accuracy (Med ACC) from 54.30% to 87.59%; in prospective trials, it achieves a CER of 14.97% and Med ACC of 84.16%, with a real-time factor of only 0.005—significantly outperforming Whisper-large-v3 and substantially boosting downstream large language model performance in information extraction.

Technology Category

Application Category

📝 Abstract

Automatic speech recognition (ASR) is a critical interface for human-AI interaction in gastrointestinal endoscopy, yet its reliability in real-world clinical settings is limited by domain-specific terminology and complex acoustic conditions. Here, we present EndoASR, a domain-adapted ASR system designed for real-time deployment in endoscopic workflows. We develop a two-stage adaptation strategy based on synthetic endoscopy reports, targeting domain-specific language modeling and noise robustness. In retrospective evaluation across six endoscopists, EndoASR substantially improves both transcription accuracy and clinical usability, reducing character error rate (CER) from 20.52% to 14.14% and increasing medical term accuracy (Med ACC) from 54.30% to 87.59%. In a prospective multi-center study spanning five independent endoscopy centers, EndoASR demonstrates consistent generalization under heterogeneous real-world conditions. Compared with the baseline Paraformer model, CER is reduced from 16.20% to 14.97%, while Med ACC is improved from 61.63% to 84.16%, confirming its robustness in practical deployment scenarios. Notably, EndoASR achieves a real-time factor (RTF) of 0.005, significantly faster than Whisper-large-v3 (RTF 0.055), while maintaining a compact model size of 220M parameters, enabling efficient edge deployment. Furthermore, integration with large language models demonstrates that improved ASR quality directly enhances downstream structured information extraction and clinician-AI interaction. These results demonstrate that domain-adapted ASR can serve as a reliable interface for human-AI teaming in gastrointestinal endoscopy, with consistent performance validated across multi-center real-world clinical settings.

Problem

Research questions and friction points this paper is trying to address.

automatic speech recognition

domain adaptation

gastrointestinal endoscopy

clinical usability

real-world deployment

Innovation

Methods, ideas, or system contributions that make the work stand out.

domain-adapted ASR

endoscopy

multi-center evaluation