Efficient ASR for Low-Resource Languages: Leveraging Cross-Lingual Unlabeled Data

📅 2025-12-08
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Low-resource automatic speech recognition (ASR) faces dual challenges: scarcity of labeled training data and prohibitive computational costs associated with large-scale models. To address these, we propose an efficient multilingual ASR framework comprising three key components: (1) construction of a cross-lingual unsupervised corpus to enable targeted continual pretraining using unlabeled data from linguistically related languages; (2) integration of morphology-aware subword tokenization to enhance subword modeling for low-resource languages; and (3) design of a scalable, reproducible data curation pipeline. Our 300M-parameter multilingual model achieves state-of-the-art Persian ASR performance—surpassing Whisper Large v3 despite using only 20% of its parameters and less supervised data—while matching its performance on Arabic and Urdu. Crucially, our work demonstrates that data relevance and training strategy outweigh sheer model scale, establishing a lightweight, reproducible, and cost-effective pathway for low-resource ASR.

Technology Category

Application Category

📝 Abstract
Automatic speech recognition for low-resource languages remains fundamentally constrained by the scarcity of labeled data and computational resources required by state-of-the-art models. We present a systematic investigation into cross-lingual continuous pretraining for low-resource languages, using Perso-Arabic languages (Persian, Arabic, and Urdu) as our primary case study. Our approach demonstrates that strategic utilization of unlabeled speech data can effectively bridge the resource gap without sacrificing recognition accuracy. We construct a 3,000-hour multilingual corpus through a scalable unlabeled data collection pipeline and employ targeted continual pretraining combined with morphologically-aware tokenization to develop a 300M parameter model that achieves performance comparable to systems 5 times larger. Our model outperforms Whisper Large v3 (1.5B parameters) on Persian and achieves competitive results on Arabic and Urdu despite using significantly fewer parameters and substantially less labeled data. These findings challenge the prevailing assumption that ASR quality scales primarily with model size, revealing instead that data relevance and strategic pretraining are more critical factors for low-resource scenarios. This work provides a practical pathway toward inclusive speech technology, enabling effective ASR for underrepresented languages without dependence on massive computational infrastructure or proprietary datasets.
Problem

Research questions and friction points this paper is trying to address.

Addresses ASR for low-resource languages with limited labeled data
Leverages cross-lingual unlabeled data to bridge resource gaps
Challenges reliance on large models for speech recognition quality
Innovation

Methods, ideas, or system contributions that make the work stand out.

Cross-lingual pretraining with unlabeled multilingual data
Morphologically-aware tokenization for Perso-Arabic languages
Continual pretraining to achieve efficiency with 300M parameters
🔎 Similar Papers
No similar papers found.
S
Srihari Bandarupalli
Speech Processing Lab, International Institute of Information Technology Hyderabad, India
B
Bhavana Akkiraju
Speech Processing Lab, International Institute of Information Technology Hyderabad, India
C
Charan Devarakonda
Speech Processing Lab, International Institute of Information Technology Hyderabad, India
V
Vamsiraghusimha Narsinga
Speech Processing Lab, International Institute of Information Technology Hyderabad, India
Anil Kumar Vuppala
Anil Kumar Vuppala
Associate Professor, LTRC, IIIT Hyderabad
Speech signal processing