MultiMed: Multilingual Medical Speech Recognition via Attention Encoder Decoder

📅 2024-09-21
🏛️ arXiv.org
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address multilingual doctor–patient communication barriers in clinical settings, this paper introduces the first clinical-domain-oriented multilingual medical automatic speech recognition (ASR) system. Methodologically, we construct the largest publicly available multilingual medical ASR dataset—covering Mandarin, English, German, French, and Vietnamese—with diverse accents, recording conditions, and speaker roles (e.g., physicians, patients, nurses). Our model is built upon an attention-based encoder-decoder architecture and integrates multilingual joint training, speech-text alignment, and domain adaptation techniques. We establish the first benchmark and systematic evaluation framework for multilingual medical ASR, encompassing single- versus multilingual performance comparisons, analysis of autoregressive end-to-end (AED) versus hybrid architectures, hierarchical ablation studies, and linguistically grounded interpretability analyses. All datasets, code, and models are fully open-sourced. Experimental results demonstrate substantial improvements in cross-lingual clinical speech understanding and intelligent clinical consultation capabilities.

Technology Category

Application Category

📝 Abstract
Multilingual automatic speech recognition (ASR) in the medical domain serves as a foundational task for various downstream applications such as speech translation, spoken language understanding, and voice-activated assistants. This technology enhances patient care by enabling efficient communication across language barriers, alleviating specialized workforce shortages, and facilitating improved diagnosis and treatment, particularly during pandemics. In this work, we introduce MultiMed, the first multilingual medical ASR dataset, along with the first collection of small-to-large end-to-end medical ASR models, spanning five languages: Vietnamese, English, German, French, and Mandarin Chinese. To our best knowledge, MultiMed stands as the world's largest medical ASR dataset across all major benchmarks: total duration, number of recording conditions, number of accents, and number of speaking roles. Furthermore, we present the first multilinguality study for medical ASR, which includes reproducible empirical baselines, a monolinguality-multilinguality analysis, Attention Encoder Decoder (AED) vs Hybrid comparative study, a layer-wise ablation study for the AED, and a linguistic analysis for multilingual medical ASR. All code, data, and models are available online: https://github.com/leduckhai/MultiMed/tree/master/MultiMed
Problem

Research questions and friction points this paper is trying to address.

Multilingual Speech Recognition
Medical Communication
Pandemic
Innovation

Methods, ideas, or system contributions that make the work stand out.

Multilingual Medical Voice Recognition
Attention Encoder-Decoder Method
Language Synergy in Healthcare
🔎 Similar Papers
No similar papers found.