Quality of Automatic Speech Recognition -- Polish Language case study -- from Wav2Vec to Scribe ElevenLabs

📅 2026-02-25
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This study addresses the challenge of improving accuracy and robustness in Polish automatic speech recognition (ASR) under low-quality audio conditions typical of medical interviews. We systematically evaluate several end-to-end ASR models—including Whisper, QuartzNet, FastConformer, Wav2Vec 2.0 XLSR, and models from the ESPnet Model Zoo—as well as the commercial system ElevenLabs Scribe. For the first time in this context, we incorporate a large language model (LLM) for post-processing, forming a two-stage recognition pipeline. Performance is assessed using word error rate (WER) and character error rate (CER) on both general and medical Polish datasets. Results demonstrate that Whisper achieves the best performance among open-source models, while ElevenLabs Scribe delivers overall superior accuracy and maintains high robustness even under audio degradation, highlighting its strong potential for real-world clinical applications.

Technology Category

Application Category

📝 Abstract
This article concerns comparative studies on the Automatic Speech Recognition (ASR) model incorporated with the Large Language Model (LLM) used for medical interviews. The proposed solution is tested on polish language benchmarks and dataset with medical interviews. The latest ASR technologies are based on convolutional neural networks (CNNs), recurrent neural networks (RNNs) and Transformers. Most of them work as end-to-end solutions. The presented approach in the case of the Whisper model shows a two-stage solution with End-To-End ASR and LLM working together in a pipeline. The ASR output is an input for LLM. The LLM is a component by which the output from ASR is corrected and improved. Comparative studies for automatic recognition of the Polish language between modern End-To-End deep learning architectures and the ASR hybrid model were performed. The medical interview tests were performed with two state-of-the-art ASR models: OpenAI Whisper incorporated with LLM and Scribe ElevenLabs. Additionally, the results were compared with five more end-to-end models (QuartzNet, FastConformer, Wav2Vec 2.0 XLSR and ESPnet Model Zoo) on Mozilla Common Voice and VoxPopuli databases. Tests were conducted for clean audio signal, signal with bandwidth limitation, and degraded. The tested models were evaluated on the basis of Word Error Rate (WER) and Character Error Rate (CER). The results show that the Whisper model performs by far the best among the open-source models. ElevenLabs Scribe model, on the other hand, performs best for Polish on both general benchmark and medical data.
Problem

Research questions and friction points this paper is trying to address.

Automatic Speech Recognition
Polish language
medical interviews
Word Error Rate
audio degradation
Innovation

Methods, ideas, or system contributions that make the work stand out.

hybrid ASR-LLM pipeline
Polish medical speech recognition
two-stage speech recognition
Whisper with LLM post-processing
comparative ASR evaluation
🔎 Similar Papers
No similar papers found.
Marcin Pietroń
Marcin Pietroń
AGH
S
Szymon Piórkowski
Coraz Zdrowiej, Krakow
K
Kamil Faber
Department of Computer Science, AGH, Krakow
D
Dominik Żurek
Department of Computer Science, AGH, Krakow
M
Michał Karwatowski
Institute of Electronics, AGH, Krakow
J
Jerzy Duda
Department of Management, AGH, Krakow
H
Hubert Zieliński
Coraz Zdrowiej, Krakow
P
Piotr Lipnicki
Coraz Zdrowiej, Krakow
Mikołaj Leszczuk
Mikołaj Leszczuk
AGH University of Science and Technology
Telecommunications