Beyond Scattered Acceptance: Fast and Coherent Inference for DLMs via Longest Stable Prefixes

📅 2026-03-05

📈 Citations: 0

✨ Influential: 0

career value

225K/year

🤖 AI Summary

Diffusion language models (DLMs) suffer from fragmented KV caching, poor memory locality, and frequent corrections of unstable token boundaries during inference due to decentralized acceptance strategies, severely limiting generation efficiency. This work proposes the Longest Stable Prefix (LSP) scheduler, which introduces—for the first time—a training- and model-agnostic paradigm for contiguous prefix absorption. By dynamically evaluating token stability in a single forward pass, LSP atomically commits a left-aligned, contiguous stable prefix, thereby restructuring the commitment topology. This approach transforms fragmented cache updates into continuous appends while preserving bidirectional lookahead capability, significantly reducing token flip rates and denoising call frequency. Evaluated on LLaDA-8B and Dream-7B, LSP achieves up to 3.4× speedup across diverse tasks—including mathematics, code generation, multilingual (Chinese, Japanese, Korean) text, and creative writing—while maintaining or slightly improving output quality.

Technology Category

Application Category

📝 Abstract

Diffusion Language Models (DLMs) promise highly parallel text generation, yet their practical inference speed is often bottlenecked by suboptimal decoding schedulers. Standard approaches rely on'scattered acceptance'-committing high confidence tokens at disjoint positions throughout the sequence. This approach inadvertently fractures the Key-Value (KV) cache, destroys memory locality, and forces the model into costly, repeated repairs across unstable token boundaries. To resolve this, we present the Longest Stable Prefix (LSP) scheduler, a training-free and model-agnostic inference paradigm based on monolithic prefix absorption. In each denoising step, LSP evaluates token stability via a single forward pass, dynamically identifies a contiguous left-aligned block of stable predictions, and snaps its boundary to natural linguistic or structural delimiters before an atomic commitment. This prefix-first topology yields dual benefits: systemically, it converts fragmented KV cache updates into efficient, contiguous appends; algorithmically, it preserves bidirectional lookahead over a geometrically shrinking active suffix, drastically reducing token flip rates and denoiser calls. Extensive evaluations on LLaDA-8B and Dream-7B demonstrate that LSP accelerates inference by up to 3.4x across rigorous benchmarks including mathematical reasoning, code generation, multilingual (CJK) tasks, and creative writing while matching or slightly improving output quality. By fundamentally restructuring the commitment topology, LSP bridges the gap between the theoretical parallelism of DLMs and practical hardware efficiency.

Problem

Research questions and friction points this paper is trying to address.

Diffusion Language Models

inference speed

scattered acceptance

KV cache fragmentation

decoding scheduler

Innovation

Methods, ideas, or system contributions that make the work stand out.

Longest Stable Prefix

Diffusion Language Models

KV cache efficiency