VietMed: A Dataset and Benchmark for Automatic Speech Recognition of Vietnamese in the Medical Domain

📅 2024-04-08

🏛️ International Conference on Language Resources and Evaluation

📈 Citations: 8

✨ Influential: 0

career value

255K/year

🤖 AI Summary

To address the lack of publicly available datasets and models for Vietnamese medical speech recognition, this paper introduces VietMed—the first large-scale, domain-specific dataset covering all ICD-10 disease categories and encompassing regional accents across Vietnam (16 hours of expertly transcribed audio + 2,200 hours of unlabeled speech). We further release the first open-source Vietnamese ASR pre-trained models (w2v2-Viet and XLSR-53-Viet) and their medically fine-tuned variants. Our approach employs self-supervised pre-training followed by supervised fine-tuning, enhanced with domain-adaptive data augmentation and lexicon-guided decoding. Notably, we achieve full ICD-10 class coverage and nationwide accent representation, and demonstrate—for the first time—that general-domain pre-trained models can effectively generalize to the medical domain. On the medical test set, XLSR-53-Viet reduces word error rate (WER) from 51.8% to 29.6% (a 42.5% relative improvement), establishing a new state-of-the-art. All data, code, and models are publicly released.

Technology Category

Application Category

📝 Abstract

Due to privacy restrictions, there’s a shortage of publicly available speech recognition datasets in the medical domain. In this work, we present VietMed - a Vietnamese speech recognition dataset in the medical domain comprising 16h of labeled medical speech, 1000h of unlabeled medical speech and 1200h of unlabeled general-domain speech. To our best knowledge, VietMed is by far the world’s largest public medical speech recognition dataset in 7 aspects: total duration, number of speakers, diseases, recording conditions, speaker roles, unique medical terms and accents. VietMed is also by far the largest public Vietnamese speech dataset in terms of total duration. Additionally, we are the first to present a medical ASR dataset covering all ICD-10 disease groups and all accents within a country. Moreover, we release the first public large-scale pre-trained models for Vietnamese ASR, w2v2-Viet and XLSR-53-Viet, along with the first public large-scale fine-tuned models for medical ASR. Even without any medical data in unsupervised pre-training, our best pre-trained model XLSR-53-Viet generalizes very well to the medical domain by outperforming state-of-the-art XLSR-53, from 51.8% to 29.6% WER on test set (a relative reduction of more than 40%). All code, data and models are made publicly available here.

Problem

Research questions and friction points this paper is trying to address.

Lack of public medical speech datasets for Vietnamese ASR.

Need for diverse medical ASR data covering diseases and accents.

Absence of pre-trained models for Vietnamese medical ASR.

Innovation

Methods, ideas, or system contributions that make the work stand out.

Largest Vietnamese medical speech dataset

First public pre-trained Vietnamese ASR models

Covers all ICD-10 disease groups

🔎 Similar Papers

MultiMed: Multilingual Medical Speech Recognition via Attention Encoder Decoder