Whisfusion: Parallel ASR Decoding via a Diffusion Transformer

πŸ“… 2025-08-09
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
To address the latency bottlenecks of autoregressive decoding in real-time automatic speech recognition (ASR) and the limited contextual modeling capability of non-autoregressive (NAR) approaches, this paper proposes Whisfusion: the first fully parallel NAR ASR framework that integrates a pretrained Whisper encoder with a text-diffusion Transformer decoder. We introduce cross-modal adapters to align encoder and decoder representations, and combine parameter-efficient fine-tuning (PEFT) with batched multi-step parallel sampling. Evaluated on LibriSpeech, Whisfusion achieves a word error rate (WER) of 8.3%, significantly outperforming Whisper-tiny (9.7%). Moreover, it processes long-form speech 2.6Γ— faster than Whisper-tiny while maintaining high accuracy. This work establishes a new paradigm for low-latency, high-fidelity long-context ASR.

Technology Category

Application Category

πŸ“ Abstract
Fast Automatic Speech Recognition (ASR) is critical for latency-sensitive applications such as real-time captioning and meeting transcription. However, truly parallel ASR decoding remains challenging due to the sequential nature of autoregressive (AR) decoders and the context limitations of non-autoregressive (NAR) methods. While modern ASR encoders can process up to 30 seconds of audio at once, AR decoders still generate tokens sequentially, creating a latency bottleneck. We propose Whisfusion, the first framework to fuse a pre-trained Whisper encoder with a text diffusion decoder. This NAR architecture resolves the AR latency bottleneck by processing the entire acoustic context in parallel at every decoding step. A lightweight cross-attention adapter trained via parameter-efficient fine-tuning (PEFT) bridges the two modalities. We also introduce a batch-parallel, multi-step decoding strategy that improves accuracy by increasing the number of candidates with minimal impact on speed. Fine-tuned solely on LibriSpeech (960h), Whisfusion achieves a lower WER than Whisper-tiny (8.3% vs. 9.7%), and offers comparable latency on short audio. For longer utterances (>20s), it is up to 2.6x faster than the AR baseline, establishing a new, efficient operating point for long-form ASR. The implementation and training scripts are available at https://github.com/taeyoun811/Whisfusion.
Problem

Research questions and friction points this paper is trying to address.

Parallel ASR decoding for latency-sensitive applications
Overcoming sequential bottleneck in autoregressive ASR decoders
Improving accuracy and speed for long-form speech recognition
Innovation

Methods, ideas, or system contributions that make the work stand out.

Fuses Whisper encoder with diffusion decoder
Uses lightweight cross-attention adapter
Implements batch-parallel multi-step decoding
πŸ”Ž Similar Papers
No similar papers found.
Taeyoun Kwon
Taeyoun Kwon
Seoul National University
Junhyuk Ahn
Junhyuk Ahn
Dept. of Electrical & Computer Engineering, Seoul National University
Artificial IntelligenceQuantum ComputingQuantum Machine Learning
T
Taegeun Yun
Soongsil University
H
Heeju Jwa
Seoul National University
Y
Yoonchae Choi
Seoul National University
S
Siwon Park
Soongsil University
N
Nam-Joon Kim
Seoul National University
J
Jangchan Kim
Seoul National University
H
Hyun Gon Ryu
NVIDIA Corporation
Hyuk-Jae Lee
Hyuk-Jae Lee
Seoul National University, Department of Electrical and Computer Engineering
인곡지λŠ₯λ©”λͺ¨λ¦¬ μ•„ν‚€ν…μ²˜μžμœ¨μ£Όν–‰μ˜μƒμ²˜λ¦¬