Focus-dLLM: Accelerating Long-Context Diffusion LLM Inference via Confidence-Guided Context Focusing

📅 2026-02-02

📈 Citations: 0

✨ Influential: 0

career value

182K/year

🤖 AI Summary

This work addresses the prohibitive computational overhead of long-context diffusion-based large language models (dLLMs), which stems from full attention mechanisms during inference. Existing sparse attention methods struggle to accurately predict the importance of yet-to-be-decoded tokens. To overcome this, the authors propose a training-free attention sparsification framework that leverages the high correlation of token confidence between adjacent diffusion steps to identify and retain critical context regions and their associated attention sinks. A sink-aware pruning strategy is introduced, and sink positions are reused across layers to further enhance efficiency. The method achieves over 29× lossless speedup on 32K-length contexts, significantly improving the inference efficiency of long-context dLLMs without compromising output quality.

Technology Category

Application Category

📝 Abstract

Diffusion Large Language Models (dLLMs) deliver strong long-context processing capability in a non-autoregressive decoding paradigm. However, the considerable computational cost of bidirectional full attention limits the inference efficiency. Although sparse attention is promising, existing methods remain ineffective. This stems from the need to estimate attention importance for tokens yet to be decoded, while the unmasked token positions are unknown during diffusion. In this paper, we present Focus-dLLM, a novel training-free attention sparsification framework tailored for accurate and efficient long-context dLLM inference. Based on the finding that token confidence strongly correlates across adjacent steps, we first design a past confidence-guided indicator to predict unmasked regions. Built upon this, we propose a sink-aware pruning strategy to accurately estimate and remove redundant attention computation, while preserving highly influential attention sinks. To further reduce overhead, this strategy reuses identified sink locations across layers, leveraging the observed cross-layer consistency. Experimental results show that our method offers more than $29\times$ lossless speedup under $32K$ context length. The code is publicly available at: https://github.com/Longxmas/Focus-dLLM

Problem

Research questions and friction points this paper is trying to address.

diffusion LLM

long-context inference

attention sparsification

computational efficiency

non-autoregressive decoding

Innovation

Methods, ideas, or system contributions that make the work stand out.

diffusion LLM

attention sparsification

confidence-guided prediction