🤖 AI Summary
To address the severe scarcity of training data for automatic speech recognition (ASR) in low-resource languages (e.g., Vatlongos, Nashta), this paper proposes three lightweight, text-only data augmentation methods: rule-based lexical substitution, random token replacement, and large language model–driven text generation—each followed by high-fidelity speech synthesis via text-to-speech (TTS). Crucially, all methods require no external annotations or parallel corpora, ensuring strong generalizability and deployment simplicity. When integrated into fine-tuning of Wav2Vec2-XLSR-53, the approach yields substantial multilingual ASR improvements, achieving a 14.3 percentage-point absolute reduction in word error rate (WER) for Nashta. Empirical evaluation confirms robust performance across both extremely low-resource and higher-resource languages. This work establishes a concise, universal, and high-performance paradigm for data expansion in low-resource ASR.
📝 Abstract
This paper introduces three self-contained data augmentation methods for low-resource Automatic Speech Recognition (ASR). Our techniques first generate novel text--using gloss-based replacement, random replacement, or an LLM-based approach--and then apply Text-to-Speech (TTS) to produce synthetic audio. We apply these methods, which leverage only the original annotated data, to four languages with extremely limited resources (Vatlongos, Nashta, Shinekhen Buryat, and Kakabe). Fine-tuning a pretrained Wav2Vec2-XLSR-53 model on a combination of the original audio and generated synthetic data yields significant performance gains, including a 14.3% absolute WER reduction for Nashta. The methods prove effective across all four low-resource languages and also show utility for high-resource languages like English, demonstrating their broad applicability.