WST: Weakly Supervised Transducer for Automatic Speech Recognition

📅 2025-11-06

📈 Citations: 0

✨ Influential: 0

career value

181K/year

🤖 AI Summary

In end-to-end automatic speech recognition (ASR), RNN-Transducer (RNN-T) models suffer from severe performance degradation under high transcription error rates (up to 70%) in real-world scenarios, due to their reliance on large-scale, high-quality labeled data. To address this, we propose Weakly Supervised Training (WST), a novel training framework that introduces a differentiable alignment graph structure—designed without confidence estimation or auxiliary pretraining—to explicitly model the fuzzy alignment between noisy text transcripts and speech sequences. This enables robust learning directly from highly erroneous transcriptions. Evaluated on both synthetic and industrial datasets, WST consistently outperforms mainstream CTC-based weakly supervised methods—including BTC and OTC—in accuracy, training stability, and generalization. Its efficiency and effectiveness make it a practical and scalable solution for low-resource ASR deployment.

Technology Category

Application Category

📝 Abstract

The Recurrent Neural Network-Transducer (RNN-T) is widely adopted in end-to-end (E2E) automatic speech recognition (ASR) tasks but depends heavily on large-scale, high-quality annotated data, which are often costly and difficult to obtain. To mitigate this reliance, we propose a Weakly Supervised Transducer (WST), which integrates a flexible training graph designed to robustly handle errors in the transcripts without requiring additional confidence estimation or auxiliary pre-trained models. Empirical evaluations on synthetic and industrial datasets reveal that WST effectively maintains performance even with transcription error rates of up to 70%, consistently outperforming existing Connectionist Temporal Classification (CTC)-based weakly supervised approaches, such as Bypass Temporal Classification (BTC) and Omni-Temporal Classification (OTC). These results demonstrate the practical utility and robustness of WST in realistic ASR settings. The implementation will be publicly available.

Problem

Research questions and friction points this paper is trying to address.

Reduces dependency on high-quality annotated speech data

Handles transcription errors without confidence estimation models

Maintains ASR performance with up to 70% error rates

Innovation

Methods, ideas, or system contributions that make the work stand out.

Uses flexible training graph for transcript errors

Eliminates need for confidence estimation models

Maintains performance with high transcription error rates

🔎 Similar Papers

No similar papers found.