CDLM: Consistency Diffusion Language Models For Faster Sampling

📅 2025-11-24

📈 Citations: 0

✨ Influential: 0

career value

207K/year

🤖 AI Summary

Diffusion language models (DLMs) suffer from low inference efficiency due to their multi-step refinement process and incompatibility with key-value (KV) caching. This paper introduces CDLM—the first training-based acceleration method for DLMs that fully supports standard KV caching. CDLM enables multi-token parallel generation via consistency modeling and introduces a block-wise causal attention mechanism, preserving strict causality while achieving full compatibility with conventional KV caches. Built upon standard Transformer architectures, CDLM requires only lightweight fine-tuning, incorporating a consistency diffusion objective, block-wise causal masking, and parallel decoding strategies. On mathematical reasoning and code generation benchmarks, CDLM reduces inference latency by 3.6–14.5× over baseline DLMs, maintaining accuracy on par with the original models. To our knowledge, CDLM is the first approach to unify efficient KV cache utilization with high-quality, few-step generation in diffusion-based language modeling.

Technology Category

Application Category

📝 Abstract

Diffusion Language Models (DLMs) offer a promising parallel generation paradigm but suffer from slow inference due to numerous refinement steps and the inability to use standard KV caching. We introduce CDLM (Consistency Diffusion Language Models), a training-based acceleration method that simultaneously tackles both bottlenecks. CDLM integrates consistency modeling to drastically reduce the number of required sampling steps by enabling multi-token finalization. Furthermore, we enforce a block-wise causal attention mask during fine-tuning, making the model fully compatible with KV caching. Experiments show CDLM achieves 3.6x-14.5x lower latency while maintaining competitive accuracy on math and coding tasks. The full training and evaluation code is available at https://github.com/SqueezeAILab/CDLM.

Problem

Research questions and friction points this paper is trying to address.

Accelerates Diffusion Language Models by reducing sampling steps

Enables KV caching compatibility through causal attention masking

Maintains competitive accuracy while achieving significant latency reduction

Innovation

Methods, ideas, or system contributions that make the work stand out.

Integrates consistency modeling for faster sampling

Uses block-wise causal attention mask for KV caching

Enables multi-token finalization with reduced steps

🔎 Similar Papers

No similar papers found.