Data Quality Issues in Multilingual Speech Datasets: The Need for Sociolinguistic Awareness and Proactive Language Planning

📅 2025-06-20

📈 Citations: 0

✨ Influential: 0

career value

186K/year

🤖 AI Summary

Multilingual speech datasets—particularly for low-resource languages—suffer from pervasive macro-level (e.g., ambiguous dialect boundaries, absence of language planning) and micro-level (e.g., grapheme–phoneme inconsistency) quality deficiencies, severely impeding ASR model training and evaluation. This paper takes Taiwanese Hokkien (nan_tw) as a case study and proposes, for the first time, a dual-track framework integrating sociolinguistic awareness and prospective language planning to embed linguistic governance directly into ASR data curation. Through cross-dataset auditing (Common Voice, FLEURS, VoxPopuli), fieldwork, dialect annotation consistency assessment, and orthographic adaptability testing, we identify significant macro-level risks in 21 of 37 languages examined. The work yields an actionable, linguistically grounded guideline for multilingual speech dataset construction, formally adopted by Hugging Face as the v2.0 community standard.

Technology Category

Application Category

📝 Abstract

Our quality audit for three widely used public multilingual speech datasets - Mozilla Common Voice 17.0, FLEURS, and VoxPopuli - shows that in some languages, these datasets suffer from significant quality issues. We believe addressing these issues will make these datasets more useful as training and evaluation sets, and improve downstream models. We divide these quality issues into two categories: micro-level and macro-level. We find that macro-level issues are more prevalent in less institutionalized, often under-resourced languages. We provide a case analysis of Taiwanese Southern Min (nan_tw) that highlights the need for proactive language planning (e.g. orthography prescriptions, dialect boundary definition) and enhanced data quality control in the process of Automatic Speech Recognition (ASR) dataset creation. We conclude by proposing guidelines and recommendations to mitigate these issues in future dataset development, emphasizing the importance of sociolinguistic awareness in creating robust and reliable speech data resources.

Problem

Research questions and friction points this paper is trying to address.

Identify quality issues in multilingual speech datasets

Address macro-level issues in under-resourced languages

Propose guidelines for better dataset development

Innovation

Methods, ideas, or system contributions that make the work stand out.

Proactive language planning for dataset quality

Enhanced data quality control in ASR

Sociolinguistic awareness in speech datasets

🔎 Similar Papers

Promoting the Responsible Development of Speech Datasets for Mental Health and Neurological Disorders Research