A Self-Refining Framework for Enhancing ASR Using TTS-Synthesized Data

📅 2025-06-10
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses unsupervised speech recognition improvement using unlabeled speech data. We propose an ASR-TTS co-iterative self-cycling framework: an initial ASR model generates pseudo-labels to train high-fidelity end-to-end TTS models (e.g., VITS or FastSpeech2), whose synthetic speech-text pairs are then used to refine the ASR model in reverse. This paradigm requires no manual annotations, teacher-model distillation, or cross-lingual parallel corpora, introducing the first closed-loop self-refinement mechanism for ASR. Evaluated on code-switched scenarios—Taiwanese-Mandarin and Mandarin-English—we achieve substantial WER reductions of 20% and 50%, respectively. Using only 6,000 hours of unlabeled speech and a small amount of text, we successfully build Twister, a domain-adapted multilingual ASR model, demonstrating the effectiveness and scalability of purely self-supervised, cross-lingual ASR optimization.

Technology Category

Application Category

📝 Abstract
We propose a self-refining framework that enhances ASR performance with only unlabeled datasets. The process starts with an existing ASR model generating pseudo-labels on unannotated speech, which are then used to train a high-fidelity text-to-speech (TTS) system. Then, synthesized speech text pairs are bootstrapped into the original ASR system, completing the closed-loop self-improvement cycle. We demonstrated the effectiveness of the framework on Taiwanese Mandarin speech. Leveraging 6,000 hours of unlabeled speech, a moderate amount of text data, and synthetic content from the AI models, we adapt Whisper-large-v2 into a specialized model, Twister. Twister reduces error rates by up to 20% on Mandarin and 50% on Mandarin-English code-switching benchmarks compared to Whisper. Results highlight the framework as a compelling alternative to pseudo-labeling self-distillation approaches and provides a practical pathway for improving ASR performance in low-resource or domain-specific settings.
Problem

Research questions and friction points this paper is trying to address.

Enhances ASR performance using unlabeled datasets
Leverages TTS-synthesized data for self-improvement
Reduces error rates in low-resource language settings
Innovation

Methods, ideas, or system contributions that make the work stand out.

Self-refining framework enhances ASR with unlabeled data
TTS synthesizes data for ASR self-improvement cycle
Adapts Whisper into Twister reducing error rates significantly
🔎 Similar Papers
No similar papers found.
C
Cheng Kang Chou
MediaTek Research, National Taiwan University
C
Chan-Jan Hsu
MediaTek Research, National Taiwan University
H
Ho-Lam Chung
National Taiwan University
Liang-Hsuan Tseng
Liang-Hsuan Tseng
National Taiwan University
deep learningspeech processing
H
Hsi-Chun Cheng
National Taiwan University
Yu-Kuan Fu
Yu-Kuan Fu
國立台灣大學電信工程所
深度學習、語言信號處理
Kuan Po Huang
Kuan Po Huang
National Taiwan University
Deep LearningSpeech Processing
H
Hung-Yi Lee
National Taiwan University