🤖 AI Summary
This work aims to enhance the language modeling capability and recognition accuracy of automatic speech recognition (ASR) systems during decoding. To this end, it introduces— for the first time—masked diffusion language models (MDLM) and uniform state diffusion models (USDM) into ASR rescoring, and proposes a joint decoding framework that effectively integrates CTC frame-level acoustic distributions with USDM token-level language distributions, enabling coherent collaboration between acoustic and linguistic information. The proposed approach supports parallel text generation and achieves significant improvements in recognition accuracy across multiple benchmarks, demonstrating the efficacy of diffusion-based language models in speech recognition. All code and experimental pipelines are publicly released.
📝 Abstract
Diffusion language models have recently emerged as a leading alternative to standard language models, due to their ability for bidirectional attention and parallel text generation. In this work, we explore variants for their use in speech recognition. Specifically, we introduce a comprehensive guide to incorporating masked diffusion language models (MDLM) and uniform-state diffusion models (USDMs) for rescoring ASR hypotheses. Additionally, we design a new joint-decoding method that combines CTC and USDM by integrating the framewise probability distributions derived from CTC with the labelwise probability distributions computed by USDM at each decoding step, thereby generating new candidates that combine strong language knowledge from USDM and acoustic information from CTC. Our findings reveal that USDM, as well as MDLM, can significantly improve the accuracy of recognized text. We publish all our code and recipes.