🤖 AI Summary
This study addresses the challenge of improving accuracy and robustness in Polish automatic speech recognition (ASR) under low-quality audio conditions typical of medical interviews. We systematically evaluate several end-to-end ASR models—including Whisper, QuartzNet, FastConformer, Wav2Vec 2.0 XLSR, and models from the ESPnet Model Zoo—as well as the commercial system ElevenLabs Scribe. For the first time in this context, we incorporate a large language model (LLM) for post-processing, forming a two-stage recognition pipeline. Performance is assessed using word error rate (WER) and character error rate (CER) on both general and medical Polish datasets. Results demonstrate that Whisper achieves the best performance among open-source models, while ElevenLabs Scribe delivers overall superior accuracy and maintains high robustness even under audio degradation, highlighting its strong potential for real-world clinical applications.
📝 Abstract
This article concerns comparative studies on the Automatic Speech Recognition (ASR) model incorporated with the Large Language Model (LLM) used for medical interviews. The proposed solution is tested on polish language benchmarks and dataset with medical interviews. The latest ASR technologies are based on convolutional neural networks (CNNs), recurrent neural networks (RNNs) and Transformers. Most of them work as end-to-end solutions. The presented approach in the case of the Whisper model shows a two-stage solution with End-To-End ASR and LLM working together in a pipeline. The ASR output is an input for LLM. The LLM is a component by which the output from ASR is corrected and improved. Comparative studies for automatic recognition of the Polish language between modern End-To-End deep learning architectures and the ASR hybrid model were performed. The medical interview tests were performed with two state-of-the-art ASR models: OpenAI Whisper incorporated with LLM and Scribe ElevenLabs. Additionally, the results were compared with five more end-to-end models (QuartzNet, FastConformer, Wav2Vec 2.0 XLSR and ESPnet Model Zoo) on Mozilla Common Voice and VoxPopuli databases. Tests were conducted for clean audio signal, signal with bandwidth limitation, and degraded. The tested models were evaluated on the basis of Word Error Rate (WER) and Character Error Rate (CER). The results show that the Whisper model performs by far the best among the open-source models. ElevenLabs Scribe model, on the other hand, performs best for Polish on both general benchmark and medical data.