🤖 AI Summary
In end-to-end automatic speech recognition (ASR), RNN-Transducer (RNN-T) models suffer from severe performance degradation under high transcription error rates (up to 70%) in real-world scenarios, due to their reliance on large-scale, high-quality labeled data. To address this, we propose Weakly Supervised Training (WST), a novel training framework that introduces a differentiable alignment graph structure—designed without confidence estimation or auxiliary pretraining—to explicitly model the fuzzy alignment between noisy text transcripts and speech sequences. This enables robust learning directly from highly erroneous transcriptions. Evaluated on both synthetic and industrial datasets, WST consistently outperforms mainstream CTC-based weakly supervised methods—including BTC and OTC—in accuracy, training stability, and generalization. Its efficiency and effectiveness make it a practical and scalable solution for low-resource ASR deployment.
📝 Abstract
The Recurrent Neural Network-Transducer (RNN-T) is widely adopted in end-to-end (E2E) automatic speech recognition (ASR) tasks but depends heavily on large-scale, high-quality annotated data, which are often costly and difficult to obtain. To mitigate this reliance, we propose a Weakly Supervised Transducer (WST), which integrates a flexible training graph designed to robustly handle errors in the transcripts without requiring additional confidence estimation or auxiliary pre-trained models. Empirical evaluations on synthetic and industrial datasets reveal that WST effectively maintains performance even with transcription error rates of up to 70%, consistently outperforming existing Connectionist Temporal Classification (CTC)-based weakly supervised approaches, such as Bypass Temporal Classification (BTC) and Omni-Temporal Classification (OTC). These results demonstrate the practical utility and robustness of WST in realistic ASR settings. The implementation will be publicly available.