Breaking the Bottleneck with DiffuApriel: High-Throughput Diffusion LMs with Mamba Backbone

📅 2025-11-19
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Diffusion language models suffer from inefficient inference due to the quadratic computational complexity of Transformer self-attention and the memory overhead of KV caching. To address this, we propose DiffuApriel—the first masked diffusion language model built upon a bidirectional Mamba backbone—introducing linear-complexity state-space modeling into diffusion-based text generation and thereby overcoming long-sequence inference bottlenecks. We further design DiffuApriel-H, a hybrid architecture that synergistically integrates attention and Mamba layers to jointly capture global semantics and local temporal dependencies. Evaluated at the 1.3B parameter scale, DiffuApriel-H achieves up to a 4.4× improvement in inference throughput and significantly reduces memory consumption, while maintaining generation quality comparable to Transformer baselines. This work establishes the first effective coupling of Mamba architectures with diffusion mechanisms in language modeling, offering a novel paradigm for efficient non-autoregressive text generation.

Technology Category

Application Category

📝 Abstract
Diffusion-based language models have recently emerged as a promising alternative to autoregressive generation, yet their reliance on Transformer backbones limits inference efficiency due to quadratic attention and KV-cache overhead. In this work, we introduce DiffuApriel, a masked diffusion language model built on a bidirectional Mamba backbone that combines the diffusion objective with linear-time sequence modeling. DiffuApriel matches the performance of Transformer-based diffusion models while achieving up to 4.4x higher inference throughput for long sequences with a 1.3B model. We further propose DiffuApriel-H, a hybrid variant that interleaves attention and mamba layers, offering up to 2.6x throughput improvement with balanced global and local context modeling. Our results demonstrate that bidirectional state-space architectures serve as strong denoisers in masked diffusion LMs, providing a practical and scalable foundation for faster, memory-efficient text generation.
Problem

Research questions and friction points this paper is trying to address.

Replacing Transformer backbones to overcome quadratic attention bottlenecks
Improving inference throughput for long-sequence diffusion language models
Balancing global-local context modeling with hybrid architecture designs
Innovation

Methods, ideas, or system contributions that make the work stand out.

Uses bidirectional Mamba backbone for diffusion
Combines diffusion objective with linear-time modeling
Hybrid variant interleaves attention and Mamba layers
🔎 Similar Papers
No similar papers found.