🤖 AI Summary
Diffusion language models (dLLMs) suffer from low inference throughput and typically require additional training for acceleration. To address this, we propose CadLLM—a training-free, model-agnostic, lightweight adaptive inference acceleration method. Its core innovation is the first introduction of a confidence-aware mechanism that dynamically analyzes the confidence distribution over unmasked tokens to adaptively adjust generation block size, sampling step size, and confidence thresholds; it further incorporates dynamic vocabulary subset selection to reduce softmax computational overhead. CadLLM is plug-and-play and fully compatible with KV caching architectures. Evaluated across four mainstream tasks, CadLLM achieves up to a 2.28× throughput improvement over baseline dLLMs while preserving generation accuracy comparable to the best-performing baselines.
📝 Abstract
We present CadLLM, a training-free method to accelerate the inference throughput of diffusion-based LLMs (dLLMs). We first investigate the dynamic nature of token unmasking confidence across blocks and steps. Based on this observation, we present a lightweight adaptive approach that controls the generation block size, step size, and threshold based on the average confidence of unmasked tokens. We further reduce softmax overhead by dynamically leveraging a subset of the vocabulary to regulate sampling breadth. CadLLM is a plug-and-play, model-agnostic method compatible with KV-cache-based dLLMs. Extensive experiments on four popular tasks demonstrate that CadLLM yields up to 2.28x throughput improvement over the state-of-the-art baseline with competitive accuracy.