🤖 AI Summary
This work addresses the inefficiency in diffusion-based large language model (DLLM) inference, where substantial computational resources are wasted on non-decodable tokens. The study reveals, for the first time, a strong correlation between token importance and its decoding probability. Building on this insight, the authors propose a dynamic computation focusing mechanism that leverages attention scores to evaluate token importance in real time, selectively retaining decodable tokens and reallocating computational resources accordingly. This approach significantly increases the effective batch size. When integrated with an optimized inference engine, the method achieves up to a 3.52× throughput improvement over the production-grade LMDeploy system while maintaining or even enhancing generation quality across multiple benchmark evaluations.
📝 Abstract
Diffusion Large Language Models (DLLMs) offer a compelling alternative to Auto-Regressive models, but their deployment is constrained by high decoding cost. In this work, we identify a key inefficiency in DLLM decoding: while computation is parallelized over token blocks, only a small subset of tokens is decodable at each diffusion step, causing most compute to be wasted on non-decodable tokens. We further observe a strong correlation between attention-derived token importance and token-wise decoding probability. Based on this insight, we propose FOCUS -- an inference system designed for DLLMs. By dynamically focusing computation on decodable tokens and evicting non-decodable ones on-the-fly, FOCUS increases the effective batch size, alleviating compute limitations and enabling scalable throughput. Empirical evaluations demonstrate that FOCUS achieves up to 3.52$\times$ throughput improvement over the production-grade engine LMDeploy, while preserving or improving generation quality across multiple benchmarks. The FOCUS system is publicly available on GitHub: https://github.com/sands-lab/FOCUS.