Improving the Throughput of Diffusion-based Large Language Models via a Training-Free Confidence-Aware Calibration

📅 2025-12-08

📈 Citations: 0

✨ Influential: 0

career value

219K/year

🤖 AI Summary

Diffusion language models (dLLMs) suffer from low inference throughput and typically require additional training for acceleration. To address this, we propose CadLLM—a training-free, model-agnostic, lightweight adaptive inference acceleration method. Its core innovation is the first introduction of a confidence-aware mechanism that dynamically analyzes the confidence distribution over unmasked tokens to adaptively adjust generation block size, sampling step size, and confidence thresholds; it further incorporates dynamic vocabulary subset selection to reduce softmax computational overhead. CadLLM is plug-and-play and fully compatible with KV caching architectures. Evaluated across four mainstream tasks, CadLLM achieves up to a 2.28× throughput improvement over baseline dLLMs while preserving generation accuracy comparable to the best-performing baselines.

Technology Category

Application Category

📝 Abstract

We present CadLLM, a training-free method to accelerate the inference throughput of diffusion-based LLMs (dLLMs). We first investigate the dynamic nature of token unmasking confidence across blocks and steps. Based on this observation, we present a lightweight adaptive approach that controls the generation block size, step size, and threshold based on the average confidence of unmasked tokens. We further reduce softmax overhead by dynamically leveraging a subset of the vocabulary to regulate sampling breadth. CadLLM is a plug-and-play, model-agnostic method compatible with KV-cache-based dLLMs. Extensive experiments on four popular tasks demonstrate that CadLLM yields up to 2.28x throughput improvement over the state-of-the-art baseline with competitive accuracy.

Problem

Research questions and friction points this paper is trying to address.

Accelerates inference throughput of diffusion-based large language models

Reduces computational overhead via adaptive token sampling strategies

Maintains accuracy while improving generation efficiency without retraining

Innovation

Methods, ideas, or system contributions that make the work stand out.

Training-free confidence-aware calibration for diffusion LLMs

Adaptive block and step size control via unmasking confidence

Dynamic vocabulary subset sampling to reduce softmax overhead

🔎 Similar Papers

No similar papers found.