Drax: Speech Recognition with Discrete Flow Matching

πŸ“… 2025-10-05
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
To address the performance bottleneck in non-autoregressive (NAR) automatic speech recognition (ASR) caused by train-inference distribution mismatch, this paper proposes Draxβ€”the first NAR ASR framework based on discrete flow matching. Drax constructs an audio-conditioned probability flow path that explicitly models intermediate erroneous token trajectories during inference, thereby mitigating distributional shift between training and inference. Theoretically, it establishes a connection between generalization error and cumulative velocity error, providing principled guidance for model design. Drax enables fully parallel decoding and achieves recognition accuracy competitive with state-of-the-art autoregressive models on benchmarks including LibriSpeech, while substantially improving decoding efficiency. Extensive experiments validate the effectiveness and scalability of discrete flow matching for ASR tasks.

Technology Category

Application Category

πŸ“ Abstract
Diffusion and flow-based non-autoregressive (NAR) models have shown strong promise in large language modeling, however, their potential for automatic speech recognition (ASR) remains largely unexplored. We propose Drax, a discrete flow matching framework for ASR that enables efficient parallel decoding. To better align training with inference, we construct an audio-conditioned probability path that guides the model through trajectories resembling likely intermediate inference errors, rather than direct random noise to target transitions. Our theoretical analysis links the generalization gap to divergences between training and inference occupancies, controlled by cumulative velocity errors, thereby motivating our design choice. Empirical evaluation demonstrates that our approach attains recognition accuracy on par with state-of-the-art speech models while offering improved accuracy-efficiency trade-offs, highlighting discrete flow matching as a promising direction for advancing NAR ASR.
Problem

Research questions and friction points this paper is trying to address.

Developing discrete flow matching for efficient parallel speech recognition decoding
Addressing training-inference misalignment through audio-conditioned probability paths
Improving accuracy-efficiency trade-offs in non-autoregressive automatic speech recognition
Innovation

Methods, ideas, or system contributions that make the work stand out.

Discrete flow matching framework for ASR
Audio-conditioned probability path design
Parallel decoding with improved accuracy-efficiency trade-offs
πŸ”Ž Similar Papers
No similar papers found.