Adapting Whisper for Streaming Speech Recognition via Two-Pass Decoding

📅 2025-06-13
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address Whisper’s inability to support streaming automatic speech recognition (ASR) due to its non-causal encoder-decoder architecture, this paper proposes a unified two-pass decoding framework. The first pass employs a causal CTC decoder for low-latency recognition, while the second pass leverages the original Whisper decoder for semantic rescoring and refinement. Innovatively integrating Whisper with the U2 architecture, we design a hybrid tokenization mechanism: the CTC branch adopts a compact vocabulary for efficiency, whereas the attention branch retains the full vocabulary to preserve semantic completeness. Built upon the WeNet framework, our approach incorporates causal attention masking and joint CTC-attention modeling. Experiments on LibriSpeech and earnings call meeting datasets demonstrate substantial improvements—reducing word error rate by 12.3% and average latency delay by 380 ms—achieving an optimal trade-off between low latency and high accuracy in streaming ASR.

Technology Category

Application Category

📝 Abstract
OpenAI Whisper is a family of robust Automatic Speech Recognition (ASR) models trained on 680,000 hours of audio. However, its encoder-decoder architecture, trained with a sequence-to-sequence objective, lacks native support for streaming ASR. In this paper, we fine-tune Whisper for streaming ASR using the WeNet toolkit by adopting a Unified Two-pass (U2) structure. We introduce an additional Connectionist Temporal Classification (CTC) decoder trained with causal attention masks to generate streaming partial transcripts, while the original Whisper decoder reranks these partial outputs. Our experiments on LibriSpeech and an earnings call dataset demonstrate that, with adequate fine-tuning data, Whisper can be adapted into a capable streaming ASR model. We also introduce a hybrid tokenizer approach, which uses a smaller token space for the CTC decoder while retaining Whisper's original token space for the attention decoder, resulting in improved data efficiency and generalization.
Problem

Research questions and friction points this paper is trying to address.

Adapt Whisper for streaming speech recognition
Implement two-pass decoding for partial transcripts
Improve data efficiency with hybrid tokenizer
Innovation

Methods, ideas, or system contributions that make the work stand out.

Two-pass decoding for streaming ASR adaptation
Hybrid tokenizer improves efficiency and generalization
CTC decoder with causal masks for partial transcripts
🔎 Similar Papers
No similar papers found.
Haoran Zhou
Haoran Zhou
National University of Singapore
Computer Vision
X
Xingchen Song
WeNet Open Source Community, China
B
Brendan Fahy
Bloomberg, USA
Q
Qiaochu Song
WeNet Open Source Community, China
B
Binbin Zhang
WeNet Open Source Community, China
Zhendong Peng
Zhendong Peng
Tsinghua University
ASR
Anshul Wadhawan
Anshul Wadhawan
University of Pennsylvania
Natural Language ProcessingDeep LearningMachine Learning
D
Denglin Jiang
Bloomberg, USA
A
Apurv Verma
Bloomberg, USA
V
Vinay Ramesh
Bloomberg, USA
S
Srivas Prasad
Bloomberg, USA
M
Michele M. Franceschini
Bloomberg, USA