Methods to Increase the Amount of Data for Speech Recognition for Low Resource Languages

📅 2025-01-08
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Automatic speech recognition (ASR) for low-resource languages such as Armenian and Georgian suffers from severe data scarcity. Method: This work systematically investigates data augmentation strategies—including crowdsourced speech collection, pseudo-labeling, and fusion-cleaning of heterogeneous audio sources (audiobooks, Common Voice, YouTube)—and introduces an end-to-end data construction pipeline integrating FastConformer training and ablation studies. It is the first to comparatively evaluate cost–quality trade-offs and language-specific dependencies of these methods in low-resource settings. Contribution/Results: Empirical analysis reveals that paid crowdsourcing achieves the optimal balance between accuracy and cost, while linguistic characteristics significantly influence method efficacy. The proposed approach yields word error rates (WER) of 9.9% on Armenian and 5.73% on Georgian—state-of-the-art for these languages—and delivers reproducible, open-source bilingual ASR models alongside a fully documented, scalable data curation workflow.

Technology Category

Application Category

📝 Abstract
This study explores methods to increase data volume for low-resource languages using techniques such as crowdsourcing, pseudo-labeling, advanced data preprocessing and various permissive data sources such as audiobooks, Common Voice, YouTube. While these methods are well-explored for highresource languages, their application for low-resource languages remains underexplored. Using Armenian and Georgian as case studies, we demonstrate how linguistic and resource-specific characteristics influence the success of these methods. This work provides practical guidance for researchers to choose cost-effective and quality-driven dataset extension strategies for low-resource languages. The key takeaway from various data extension approaches is that paid crowd-sourcing offers the best balance between cost and quality, outperforming volunteer crowd-sourcing, open-source audiobooks, and unlabeled data usage. Ablation study shows that models trained on the expanded datasets outperform existing baselines and achieve 5.73% for Gergian and 9.9% for Armenian ASR word error rate using a relatively small FastConformer architecture. We open-sourced both the Armenian and Georgian models to allow further research and practical applications.
Problem

Research questions and friction points this paper is trying to address.

Increase data for low-resource languages
Explore methods like crowdsourcing, pseudo-labeling
Apply techniques to Armenian and Georgian
Innovation

Methods, ideas, or system contributions that make the work stand out.

Crowdsourcing for data expansion
Advanced data preprocessing techniques
Open-sourced Armenian and Georgian models
🔎 Similar Papers
No similar papers found.
A
Alexan Ayrapetyan
NVIDIA, Yerevan, Armenia
S
Sofia Kostandian
NVIDIA, Yerevan, Armenia
A
Ara Yeroyan
Plat.ai, Yerevan, Armenia
M
Mher Yerznkanyan
Buymie, Yerevan, Armenia
Nikolay Karpov
Nikolay Karpov
NVIDIA
speech recognitioncomputational linguisticsinformation retrieval
N
Nune Tadevosyan
NVIDIA, Yerevan, Armenia
Vitaly Lavrukhin
Vitaly Lavrukhin
NVIDIA
Boris Ginsburg
Boris Ginsburg
NVIDIA
Deep LearningSpeech RecognitionSpeech Synthesis