Towards Closing the Autoregressive Gap in Language Modeling via Entropy-Gated Continuous Bitstream Diffusion

📅 2026-05-07
📈 Citations: 0
Influential: 0
📄 PDF

career value

194K/year
🤖 AI Summary
Diffusion-based language models have long been constrained by limitations in generation quality, diversity, and computational bottlenecks arising from vocabulary expansion. This work proposes a bitstream diffusion framework that represents text as fixed-width continuous binary bitstreams, encoding tokens via semantic bit sequences. By introducing matched-filter residual parameterization and an entropy-rate-gated stochastic sampling strategy, the model concentrates refinement in high-information regions, thereby enhancing both generation efficiency and quality. The approach overcomes the vocabulary-size limitations of conventional diffusion models, substantially reducing memory consumption and increasing throughput. Experiments demonstrate that on LM1B, the method achieves a GenPPL of 59.76 with only 256 NFE—approaching the true data entropy of 4.31 bits—and on OWT, it surpasses prior 1024-NFE baselines (GenPPL 27.06) using just one-quarter of the sampling steps, matching the performance of autoregressive models.
📝 Abstract
Diffusion language models (DLMs) promise parallel, order-agnostic generation, but on standard benchmarks they have historically lagged behind autoregressive models in sample quality and diversity. Recent continuous flow and diffusion approaches over token embeddings have narrowed this gap, suggesting continuous state spaces are highly effective for language. In this work, we further close the autoregressive gap by modeling text as a continuous diffusion process over fixed-width binary bitstreams. Our approach represents semantic tokens as analog bit sequences and utilizes a matched-filter residual parameterization to isolate contextual learning from analytic independent-bit posteriors. Crucially, we adopt a stochastic sampler that applies Langevin-type corrections gated by the entropy-rate profile, automatically concentrating stochasticity in high-information regions while remaining nearly deterministic elsewhere. On the One Billion Word Benchmark (LM1B), our 130M-parameter bitstream model reaches a generative perplexity ($\GenPPL$) of $59.76$ at matched real-data entropy ($4.31$) using 256 neural function evaluations (NFEs), decisively outperforming prior DLM baselines and reaching the autoregressive reference. On OpenWebText (OWT), our stochastic sampler establishes a new continuous-DLM Pareto frontier, achieving $\GenPPL=27.06$ at an entropy of $5.26$ using $4\times$ fewer steps than previous 1024-NFE baselines. As an additional architectural benefit, bitstream diffusion removes the $\mathcal{O}(V)$ vocabulary scaling bottleneck shared by standard DLMs. By predicting $\mathcal{O}(\log V)$ bitwise logits via semantic bit-patching, our model yields a reduced memory footprint and higher throughput, demonstrating a scalable paradigm for language generation as vocabulary sizes grow.
Problem

Research questions and friction points this paper is trying to address.

diffusion language models
autoregressive gap
sample quality
generation diversity
language modeling
Innovation

Methods, ideas, or system contributions that make the work stand out.

bitstream diffusion
entropy-gated sampling
continuous language modeling
vocabulary scalability
generative perplexity
🔎 Similar Papers
No similar papers found.