🤖 AI Summary
Diffusion language models suffer from inefficient inference due to the quadratic computational complexity of Transformer self-attention and the memory overhead of KV caching. To address this, we propose DiffuApriel—the first masked diffusion language model built upon a bidirectional Mamba backbone—introducing linear-complexity state-space modeling into diffusion-based text generation and thereby overcoming long-sequence inference bottlenecks. We further design DiffuApriel-H, a hybrid architecture that synergistically integrates attention and Mamba layers to jointly capture global semantics and local temporal dependencies. Evaluated at the 1.3B parameter scale, DiffuApriel-H achieves up to a 4.4× improvement in inference throughput and significantly reduces memory consumption, while maintaining generation quality comparable to Transformer baselines. This work establishes the first effective coupling of Mamba architectures with diffusion mechanisms in language modeling, offering a novel paradigm for efficient non-autoregressive text generation.
📝 Abstract
Diffusion-based language models have recently emerged as a promising alternative to autoregressive generation, yet their reliance on Transformer backbones limits inference efficiency due to quadratic attention and KV-cache overhead. In this work, we introduce DiffuApriel, a masked diffusion language model built on a bidirectional Mamba backbone that combines the diffusion objective with linear-time sequence modeling. DiffuApriel matches the performance of Transformer-based diffusion models while achieving up to 4.4x higher inference throughput for long sequences with a 1.3B model. We further propose DiffuApriel-H, a hybrid variant that interleaves attention and mamba layers, offering up to 2.6x throughput improvement with balanced global and local context modeling. Our results demonstrate that bidirectional state-space architectures serve as strong denoisers in masked diffusion LMs, providing a practical and scalable foundation for faster, memory-efficient text generation.