BlockBatch: Multi-Scale Consensus Decoding for Efficient Diffusion Language Model Inference

📅 2026-05-27

📈 Citations: 0

✨ Influential: 0

career value

215K/year

🤖 AI Summary

This work addresses the trade-off in diffusion language model inference between preserving local dependencies and achieving parallel efficiency, where small block sizes require more denoising steps while large blocks risk accumulating cache errors due to premature decisions. The authors propose BlockBatch, the first training-free, online inference framework that treats block size as a branching dimension. By executing multi-scale block inference in a single forward pass, BlockBatch leverages the observation that semantic-critical positions diverge while syntactic positions converge across scales. It integrates confidence-gated merging, leader synchronization, and periodic KV cache refreshing to enable efficient multi-scale consensus decoding. Evaluated on three diffusion language models and four datasets, BlockBatch reduces denoising function calls by 26.6% on average and achieves a 1.33× speedup over Fast-dLLM without compromising generation quality.

📝 Abstract

Diffusion language models (dLLMs) generate text by iteratively denoising multiple token positions in parallel, offering an attractive alternative to strictly autoregressive decoding. In practice, however, block-wise dLLM inference exposes a difficult granularity trade-off: small blocks preserve local conditioning but require many denoising steps, whereas large blocks expose more parallelism but can make premature commitments and accumulate cache error. Existing acceleration methods typically choose a single block size per request, leaving the complementarity among block sizes unused. We show that block size itself is a useful branching dimension. Different block sizes induce related but non-identical KV-cache trajectories: branches often share an initial prefix, bifurcate at semantically decisive positions, and later agree on syntactically lightweight tokens. Motivated by this structure, we propose BlockBatch, a training-free online inference framework that executes multiple block-size branches for the same request inside a batched forward pass. BlockBatch coordinates these branches through confidence-gated token merging, leader-based synchronization, and periodic full-sequence refreshes that re-anchor local block updates to a globally consistent KV state. Across 3 representative dLLMs and 4 datasets, BlockBatch reduces denoising NFEs by 26.6\% on average and achieves a 1.33$\times$ average end-to-end speedup over Fast-dLLM while preserving accuracy. These results identify block-size diversity as a practical and previously underexplored axis for branch-parallel dLLM inference.

Problem

Research questions and friction points this paper is trying to address.

diffusion language models

block size

inference efficiency

KV-cache

granularity trade-off

Innovation

Methods, ideas, or system contributions that make the work stand out.

BlockBatch

diffusion language models

multi-scale decoding