π€ AI Summary
Diffusion language models (DLMs) suffer from low inference efficiency due to their multi-step refinement process and incompatibility with key-value (KV) caching. This paper introduces CDLMβthe first training-based acceleration method for DLMs that fully supports standard KV caching. CDLM enables multi-token parallel generation via consistency modeling and introduces a block-wise causal attention mechanism, preserving strict causality while achieving full compatibility with conventional KV caches. Built upon standard Transformer architectures, CDLM requires only lightweight fine-tuning, incorporating a consistency diffusion objective, block-wise causal masking, and parallel decoding strategies. On mathematical reasoning and code generation benchmarks, CDLM reduces inference latency by 3.6β14.5Γ over baseline DLMs, maintaining accuracy on par with the original models. To our knowledge, CDLM is the first approach to unify efficient KV cache utilization with high-quality, few-step generation in diffusion-based language modeling.
π Abstract
Diffusion Language Models (DLMs) offer a promising parallel generation paradigm but suffer from slow inference due to numerous refinement steps and the inability to use standard KV caching. We introduce CDLM (Consistency Diffusion Language Models), a training-based acceleration method that simultaneously tackles both bottlenecks. CDLM integrates consistency modeling to drastically reduce the number of required sampling steps by enabling multi-token finalization. Furthermore, we enforce a block-wise causal attention mask during fine-tuning, making the model fully compatible with KV caching. Experiments show CDLM achieves 3.6x-14.5x lower latency while maintaining competitive accuracy on math and coding tasks. The full training and evaluation code is available at https://github.com/SqueezeAILab/CDLM.