BitLM: Unlocking Multi-Token Language Generation with Bitwise Continuous Diffusion

📅 2026-05-12
📈 Citations: 0
Influential: 0
📄 PDF

career value

198K/year
🤖 AI Summary
Traditional autoregressive language models are constrained by their token-by-token generation paradigm, which hinders efficient modeling of multi-token semantic units and limits both expressive capacity and inference speed. This work proposes a novel parallel multi-token generation architecture based on bit-level continuous diffusion: each token is encoded into a fixed-length binary representation, and within causal attention blocks, a lightweight diffusion head enables parallel denoising of multiple tokens while preserving inter-block autoregressive dependencies. By integrating the reliability of autoregressive modeling with the efficiency of iterative parallel generation, the method substantially improves pretraining efficiency and inference throughput. The results demonstrate that single-token sequential generation is not an inherent limitation of language models, offering a promising pathway toward building stronger and faster next-generation architectures.
📝 Abstract
Autoregressive language models generate text one token at a time, yet natural language is inherently structured in multi-token units, including phrases, n-grams, and collocations that carry meaning jointly. This one-token bottleneck limits both the expressiveness of the model during pre-training and its throughput at inference time. Existing remedies such as speculative decoding or diffusion-based language models either leave the underlying bottleneck intact or sacrifice the causal structure essential to language modeling. We propose BitLM, a language model that represents each token as a fixed-length binary code and employs a lightweight diffusion head to denoise multiple tokens in parallel within each block. Crucially, BitLM preserves left-to-right causal attention across blocks while making joint lexical decisions within each block, combining the reliability of autoregressive modeling with the parallelism of iterative refinement. By replacing the large-vocabulary softmax with bitwise denoising, BitLM reframes token generation as iterative commitment in a compact binary space, enabling more efficient pre-training and substantially faster inference without altering the causal foundation that makes language models effective. Our results demonstrate that the one-token-at-a-time paradigm is not a fundamental requirement but an interface choice, and that changing it can yield a stronger and faster language model. We hope BitLM points toward a promising direction for next-generation language model architectures.
Problem

Research questions and friction points this paper is trying to address.

autoregressive language models
multi-token generation
token bottleneck
language modeling
inference throughput
Innovation

Methods, ideas, or system contributions that make the work stand out.

Bitwise Diffusion
Multi-Token Generation
Binary Token Representation
Parallel Language Modeling
Causal Attention