Towards Closing the Autoregressive Gap in Language Modeling via Entropy-Gated Continuous Bitstream Diffusion

📅 2026-05-07

📈 Citations: 0

✨ Influential: 0

career value

168K/year

🤖 AI Summary

Diffusion-based language models have long been constrained by limitations in generation quality, diversity, and computational bottlenecks arising from vocabulary expansion. This work proposes a bitstream diffusion framework that represents text as fixed-width continuous binary bitstreams, encoding tokens via semantic bit sequences. By introducing matched-filter residual parameterization and an entropy-rate-gated stochastic sampling strategy, the model concentrates refinement in high-information regions, thereby enhancing both generation efficiency and quality. The approach overcomes the vocabulary-size limitations of conventional diffusion models, substantially reducing memory consumption and increasing throughput. Experiments demonstrate that on LM1B, the method achieves a GenPPL of 59.76 with only 256 NFE—approaching the true data entropy of 4.31 bits—and on OWT, it surpasses prior 1024-NFE baselines (GenPPL 27.06) using just one-quarter of the sampling steps, matching the performance of autoregressive models.

📝 Abstract

Diffusion language models (DLMs) promise parallel, order-agnostic generation, but on standard benchmarks they have historically lagged behind autoregressive models in sample quality and diversity. Recent continuous flow and diffusion approaches over token embeddings have narrowed this gap, suggesting continuous state spaces are highly effective for language. In this work, we further close the autoregressive gap by modeling text as a continuous diffusion process over fixed-width binary bitstreams. Our approach represents semantic tokens as analog bit sequences and utilizes a matched-filter residual parameterization to isolate contextual learning from analytic independent-bit posteriors. Crucially, we adopt a stochastic sampler that applies Langevin-type corrections gated by the entropy-rate profile, automatically concentrating stochasticity in high-information regions while remaining nearly deterministic elsewhere. On the One Billion Word Benchmark (LM1B), our 130M-parameter bitstream model reaches a generative perplexity ($\GenPPL$) of $59.76$ at matched real-data entropy ($4.31$) using 256 neural function evaluations (NFEs), decisively outperforming prior DLM baselines and reaching the autoregressive reference. On OpenWebText (OWT), our stochastic sampler establishes a new continuous-DLM Pareto frontier, achieving $\GenPPL=27.06$ at an entropy of $5.26$ using $4\times$ fewer steps than previous 1024-NFE baselines. As an additional architectural benefit, bitstream diffusion removes the $\mathcal{O}(V)$ vocabulary scaling bottleneck shared by standard DLMs. By predicting $\mathcal{O}(\log V)$ bitwise logits via semantic bit-patching, our model yields a reduced memory footprint and higher throughput, demonstrating a scalable paradigm for language generation as vocabulary sizes grow.

Problem

Research questions and friction points this paper is trying to address.

diffusion language models

autoregressive gap

sample quality

generation diversity

language modeling

Innovation

Methods, ideas, or system contributions that make the work stand out.

bitstream diffusion

entropy-gated sampling

continuous language modeling