🤖 AI Summary
To address Whisper’s inability to support streaming automatic speech recognition (ASR) due to its non-causal encoder-decoder architecture, this paper proposes a unified two-pass decoding framework. The first pass employs a causal CTC decoder for low-latency recognition, while the second pass leverages the original Whisper decoder for semantic rescoring and refinement. Innovatively integrating Whisper with the U2 architecture, we design a hybrid tokenization mechanism: the CTC branch adopts a compact vocabulary for efficiency, whereas the attention branch retains the full vocabulary to preserve semantic completeness. Built upon the WeNet framework, our approach incorporates causal attention masking and joint CTC-attention modeling. Experiments on LibriSpeech and earnings call meeting datasets demonstrate substantial improvements—reducing word error rate by 12.3% and average latency delay by 380 ms—achieving an optimal trade-off between low latency and high accuracy in streaming ASR.
📝 Abstract
OpenAI Whisper is a family of robust Automatic Speech Recognition (ASR) models trained on 680,000 hours of audio. However, its encoder-decoder architecture, trained with a sequence-to-sequence objective, lacks native support for streaming ASR. In this paper, we fine-tune Whisper for streaming ASR using the WeNet toolkit by adopting a Unified Two-pass (U2) structure. We introduce an additional Connectionist Temporal Classification (CTC) decoder trained with causal attention masks to generate streaming partial transcripts, while the original Whisper decoder reranks these partial outputs. Our experiments on LibriSpeech and an earnings call dataset demonstrate that, with adequate fine-tuning data, Whisper can be adapted into a capable streaming ASR model. We also introduce a hybrid tokenizer approach, which uses a smaller token space for the CTC decoder while retaining Whisper's original token space for the attention decoder, resulting in improved data efficiency and generalization.