Efficient Parallel Samplers for Recurrent-Depth Models and Their Connection to Diffusion Language Models

📅 2025-10-16

📈 Citations: 0

✨ Influential: 0

career value

189K/year

🤖 AI Summary

This work addresses the low generation efficiency and insufficient theoretical grounding of recurrent deep language models (RDLMs). We establish, for the first time, a formal theoretical connection between RDLMs and diffusion language models, and propose a fine-tuning-free, plug-and-play diffusion-guided sampler. Our method reformulates text generation as an iterative refinement of latent states, enabling parallel decoding of multiple tokens within a single forward pass; recurrent state updates are performed efficiently in parallel via a dedicated cyclic architecture. Evaluated on a 3.5B-parameter RDLM, our approach achieves up to 5× inference speedup while surpassing autoregressive generation in expressiveness under identical time budgets. Beyond substantial acceleration, this work provides a novel theoretical interpretation of RDLMs through the lens of diffusion probabilistic modeling—unifying sequential generation with iterative denoising principles and offering fresh insights into latent dynamics in recurrent architectures.

Technology Category

Application Category

📝 Abstract

Language models with recurrent depth, also referred to as universal or looped when considering transformers, are defined by the capacity to increase their computation through the repetition of layers. Recent efforts in pretraining have demonstrated that these architectures can scale to modern language modeling tasks while exhibiting advantages in reasoning tasks. In this work, we examine the relationship between recurrent-depth models and diffusion language models. Building on their similarities, we develop a new diffusion forcing sampler for these models to accelerate generation. The sampler advances by decoding new tokens at every forward pass of the model, while the latent states of these tokens can be further refined in parallel through recurrence. Theoretically, generation with our sampler is strictly more expressive than the baseline autoregressive generation using the same time budget on modern hardware. Moreover, this sampler, based on principles from diffusion literature, can be directly applied to existing 3.5B recurrent-depth transformers without any tuning, leading to up to a 5x speedup. Consequently, our findings not only provide an efficient mechanism for parallelizing the extra computation in recurrent-depth models at inference, but also suggest that such models can be naturally viewed as strong continuous, though causal, diffusion language models.

Problem

Research questions and friction points this paper is trying to address.

Develops parallel sampler for recurrent-depth language models

Accelerates generation using diffusion model principles

Connects recurrent-depth models to diffusion language models theoretically

Innovation

Methods, ideas, or system contributions that make the work stand out.

Developed diffusion forcing sampler for recurrent-depth models

Decodes tokens per forward pass with parallel latent refinement

Achieves 5x speedup on existing transformers without tuning

🔎 Similar Papers

Non-autoregressive Sequence-to-Sequence Vision-Language Models