Improving the Throughput of Diffusion-based Large Language Models via a Training-Free Confidence-Aware Calibration

📅 2025-12-08
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Diffusion language models (dLLMs) suffer from low inference throughput and typically require additional training for acceleration. To address this, we propose CadLLM—a training-free, model-agnostic, lightweight adaptive inference acceleration method. Its core innovation is the first introduction of a confidence-aware mechanism that dynamically analyzes the confidence distribution over unmasked tokens to adaptively adjust generation block size, sampling step size, and confidence thresholds; it further incorporates dynamic vocabulary subset selection to reduce softmax computational overhead. CadLLM is plug-and-play and fully compatible with KV caching architectures. Evaluated across four mainstream tasks, CadLLM achieves up to a 2.28× throughput improvement over baseline dLLMs while preserving generation accuracy comparable to the best-performing baselines.

Technology Category

Application Category

📝 Abstract
We present CadLLM, a training-free method to accelerate the inference throughput of diffusion-based LLMs (dLLMs). We first investigate the dynamic nature of token unmasking confidence across blocks and steps. Based on this observation, we present a lightweight adaptive approach that controls the generation block size, step size, and threshold based on the average confidence of unmasked tokens. We further reduce softmax overhead by dynamically leveraging a subset of the vocabulary to regulate sampling breadth. CadLLM is a plug-and-play, model-agnostic method compatible with KV-cache-based dLLMs. Extensive experiments on four popular tasks demonstrate that CadLLM yields up to 2.28x throughput improvement over the state-of-the-art baseline with competitive accuracy.
Problem

Research questions and friction points this paper is trying to address.

Accelerates inference throughput of diffusion-based large language models
Reduces computational overhead via adaptive token sampling strategies
Maintains accuracy while improving generation efficiency without retraining
Innovation

Methods, ideas, or system contributions that make the work stand out.

Training-free confidence-aware calibration for diffusion LLMs
Adaptive block and step size control via unmasking confidence
Dynamic vocabulary subset sampling to reduce softmax overhead
🔎 Similar Papers
No similar papers found.
J
Jucheng Shen
Rice University
G
Gaurav Sarkar
Intel Labs
Yeonju Ro
Yeonju Ro
UT Austin
Systems for MLAI Algorithm-System Co-designML for Systems
Sharath Nittur Sridhar
Sharath Nittur Sridhar
Research Scientist, Intel Corporation
Efficient AILLMMultimodal Foundation ModelsNLPComputer Vision
Z
Zhangyang Wang
The University of Texas at Austin
Z
Zhangyang Wang
The University of Texas at Austin