From Tens of Hours to Tens of Thousands: Scaling Back-Translation for Speech Recognition

📅 2025-05-22
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Multi-lingual automatic speech recognition (ASR) suffers from severe scarcity of authentic speech data—often limited to just tens of hours. To address this, we propose a novel ASR-optimized speech back-translation paradigm: leveraging a small amount of real speech to train a multilingual text-to-speech (TTS) model, which then synthesizes high-intelligibility speech from large-scale textual corpora, enabling scalable generation of synthetic data—from hours to over 100,000 hours. We introduce the first intelligibility-based framework for evaluating synthetic speech quality, along with empirically derived validity thresholds, to rigorously filter high-quality samples. Using this pipeline, we generate over 500,000 hours of synthetic speech across 10 languages. Continual pretraining of Whisper-large-v3 on this data yields an average word error rate reduction exceeding 30%, demonstrating that highly effective multilingual ASR data engines can be built using only minimal authentic speech resources.

Technology Category

Application Category

📝 Abstract
Recent advances in Automatic Speech Recognition (ASR) have been largely fueled by massive speech corpora. However, extending coverage to diverse languages with limited resources remains a formidable challenge. This paper introduces Speech Back-Translation, a scalable pipeline that improves multilingual ASR models by converting large-scale text corpora into synthetic speech via off-the-shelf text-to-speech (TTS) models. We demonstrate that just tens of hours of real transcribed speech can effectively train TTS models to generate synthetic speech at hundreds of times the original volume while maintaining high quality. To evaluate synthetic speech quality, we develop an intelligibility-based assessment framework and establish clear thresholds for when synthetic data benefits ASR training. Using Speech Back-Translation, we generate more than 500,000 hours of synthetic speech in ten languages and continue pre-training Whisper-large-v3, achieving average transcription error reductions of over 30%. These results highlight the scalability and effectiveness of Speech Back-Translation for enhancing multilingual ASR systems.
Problem

Research questions and friction points this paper is trying to address.

Extending multilingual ASR coverage with limited speech resources
Scaling synthetic speech generation using small real speech datasets
Improving ASR accuracy via large-scale synthetic speech data
Innovation

Methods, ideas, or system contributions that make the work stand out.

Scaling back-translation using synthetic speech generation
Training TTS models with minimal real speech data
Intelligibility-based framework for synthetic speech assessment
🔎 Similar Papers
No similar papers found.
T
Tianduo Wang
StatNLP Research Group, Singapore University of Technology and Design
Lu Xu
Lu Xu
Postdoc, Riken AIP
deep learningmachine learningcomputer vision
W
Wei Lu
StatNLP Research Group, Singapore University of Technology and Design
Shanbo Cheng
Shanbo Cheng
ByteDance Seed
LLMsMLNLPMachine TranslationMulti modal