Diffusion Language Models for Speech Recognition

📅 2026-04-15
📈 Citations: 0
Influential: 0
📄 PDF

career value

196K/year
🤖 AI Summary
This work aims to enhance the language modeling capability and recognition accuracy of automatic speech recognition (ASR) systems during decoding. To this end, it introduces— for the first time—masked diffusion language models (MDLM) and uniform state diffusion models (USDM) into ASR rescoring, and proposes a joint decoding framework that effectively integrates CTC frame-level acoustic distributions with USDM token-level language distributions, enabling coherent collaboration between acoustic and linguistic information. The proposed approach supports parallel text generation and achieves significant improvements in recognition accuracy across multiple benchmarks, demonstrating the efficacy of diffusion-based language models in speech recognition. All code and experimental pipelines are publicly released.

Technology Category

Application Category

📝 Abstract
Diffusion language models have recently emerged as a leading alternative to standard language models, due to their ability for bidirectional attention and parallel text generation. In this work, we explore variants for their use in speech recognition. Specifically, we introduce a comprehensive guide to incorporating masked diffusion language models (MDLM) and uniform-state diffusion models (USDMs) for rescoring ASR hypotheses. Additionally, we design a new joint-decoding method that combines CTC and USDM by integrating the framewise probability distributions derived from CTC with the labelwise probability distributions computed by USDM at each decoding step, thereby generating new candidates that combine strong language knowledge from USDM and acoustic information from CTC. Our findings reveal that USDM, as well as MDLM, can significantly improve the accuracy of recognized text. We publish all our code and recipes.
Problem

Research questions and friction points this paper is trying to address.

Diffusion Language Models
Speech Recognition
ASR Rescoring
CTC
USDM
Innovation

Methods, ideas, or system contributions that make the work stand out.

diffusion language models
speech recognition
uniform-state diffusion models
joint decoding
CTC
🔎 Similar Papers
No similar papers found.