Frustratingly Easy Data Augmentation for Low-Resource ASR

📅 2025-09-18

📈 Citations: 0

✨ Influential: 0

career value

172K/year

🤖 AI Summary

To address the severe scarcity of training data for automatic speech recognition (ASR) in low-resource languages (e.g., Vatlongos, Nashta), this paper proposes three lightweight, text-only data augmentation methods: rule-based lexical substitution, random token replacement, and large language model–driven text generation—each followed by high-fidelity speech synthesis via text-to-speech (TTS). Crucially, all methods require no external annotations or parallel corpora, ensuring strong generalizability and deployment simplicity. When integrated into fine-tuning of Wav2Vec2-XLSR-53, the approach yields substantial multilingual ASR improvements, achieving a 14.3 percentage-point absolute reduction in word error rate (WER) for Nashta. Empirical evaluation confirms robust performance across both extremely low-resource and higher-resource languages. This work establishes a concise, universal, and high-performance paradigm for data expansion in low-resource ASR.

Technology Category

Application Category

📝 Abstract

This paper introduces three self-contained data augmentation methods for low-resource Automatic Speech Recognition (ASR). Our techniques first generate novel text--using gloss-based replacement, random replacement, or an LLM-based approach--and then apply Text-to-Speech (TTS) to produce synthetic audio. We apply these methods, which leverage only the original annotated data, to four languages with extremely limited resources (Vatlongos, Nashta, Shinekhen Buryat, and Kakabe). Fine-tuning a pretrained Wav2Vec2-XLSR-53 model on a combination of the original audio and generated synthetic data yields significant performance gains, including a 14.3% absolute WER reduction for Nashta. The methods prove effective across all four low-resource languages and also show utility for high-resource languages like English, demonstrating their broad applicability.

Problem

Research questions and friction points this paper is trying to address.

Data augmentation for low-resource ASR

Generating synthetic audio via TTS

Improving WER in limited-resource languages

Innovation

Methods, ideas, or system contributions that make the work stand out.

Gloss-based and random text replacement

LLM-generated text with TTS conversion

Fine-tuning pretrained model with synthetic data

🔎 Similar Papers

Exploring the Impact of Data Quantity on ASR in Extremely Low-resource Languages