WST: Weakly Supervised Transducer for Automatic Speech Recognition

📅 2025-11-06
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
In end-to-end automatic speech recognition (ASR), RNN-Transducer (RNN-T) models suffer from severe performance degradation under high transcription error rates (up to 70%) in real-world scenarios, due to their reliance on large-scale, high-quality labeled data. To address this, we propose Weakly Supervised Training (WST), a novel training framework that introduces a differentiable alignment graph structure—designed without confidence estimation or auxiliary pretraining—to explicitly model the fuzzy alignment between noisy text transcripts and speech sequences. This enables robust learning directly from highly erroneous transcriptions. Evaluated on both synthetic and industrial datasets, WST consistently outperforms mainstream CTC-based weakly supervised methods—including BTC and OTC—in accuracy, training stability, and generalization. Its efficiency and effectiveness make it a practical and scalable solution for low-resource ASR deployment.

Technology Category

Application Category

📝 Abstract
The Recurrent Neural Network-Transducer (RNN-T) is widely adopted in end-to-end (E2E) automatic speech recognition (ASR) tasks but depends heavily on large-scale, high-quality annotated data, which are often costly and difficult to obtain. To mitigate this reliance, we propose a Weakly Supervised Transducer (WST), which integrates a flexible training graph designed to robustly handle errors in the transcripts without requiring additional confidence estimation or auxiliary pre-trained models. Empirical evaluations on synthetic and industrial datasets reveal that WST effectively maintains performance even with transcription error rates of up to 70%, consistently outperforming existing Connectionist Temporal Classification (CTC)-based weakly supervised approaches, such as Bypass Temporal Classification (BTC) and Omni-Temporal Classification (OTC). These results demonstrate the practical utility and robustness of WST in realistic ASR settings. The implementation will be publicly available.
Problem

Research questions and friction points this paper is trying to address.

Reduces dependency on high-quality annotated speech data
Handles transcription errors without confidence estimation models
Maintains ASR performance with up to 70% error rates
Innovation

Methods, ideas, or system contributions that make the work stand out.

Uses flexible training graph for transcript errors
Eliminates need for confidence estimation models
Maintains performance with high transcription error rates
🔎 Similar Papers
No similar papers found.
Dongji Gao
Dongji Gao
Johns Hopkins University
Chenda Liao
Chenda Liao
Microsoft
C
Changliang Liu
Microsoft
Matthew Wiesner
Matthew Wiesner
Research Scientist, Johns Hopkins University
Speech Recognition
L
Leibny Paola García
Johns Hopkins University
D
Dan Povey
Xiaomi
S
S. Khudanpur
Johns Hopkins University
J
Jian Wu
Microsoft