Elastic-dLLM: Position Preserving Context Compression and Augmentation of Diffusion LLMs

📅 2026-05-18

📈 Citations: 0

✨ Influential: 0

career value

178K/year

🤖 AI Summary

Diffusion-based large language models (dLLMs) suffer from high computational overhead during parallel decoding due to excessive redundant [MASK] tokens and repeated context. This work is the first to uncover the redundancy mechanism from the perspective of [MASK] tokens and proposes a position-preserving [MASK] compression method that significantly reduces computational cost while retaining structural information. Furthermore, it introduces terminal-aware context enhancement and context folding expansion techniques to naturally and efficiently support long contexts. Experiments on the LLaDA model series demonstrate that the proposed approach substantially accelerates decoding and improves generation quality with minimal additional computational overhead.

📝 Abstract

Unlike autoregressive models, which generate one token at a time, dLLMs denoise a chunk of [MASK] tokens jointly and sample one or more tokens per step; despite enabling parallel decoding, this process incurs substantial computational cost due to the large chunk size of masked tokens. We observe that much of this cost is spent on repeatedly processing the preceding context and many [MASK] tokens with the same feature representations, indicating considerable computational redundancy. In this work, we revisit dLLM's redundancy from the perspective of [MASK] tokens. Through systematic analysis, we verify the redundancy of [MASK] tokens while revealing their critical role in providing structural information. Guided by these findings, we propose position-preserving [MASK] token compression and terminal-aware augmentation. By compressing redundant [MASK] computation, this approach accelerates decoding and further provides a natural extension toward context-folding-like long-context scaling under limited input-length constraints for full-sequence dLLMs such as LLaDA-8B-Instruct and LLaDA-1.5. Moreover, for block dLLMs such as LLaDA2.0-mini, it augments the context with a protected terminal [MASK] token to enhance generation quality with negligible overhead.

Problem

Research questions and friction points this paper is trying to address.

diffusion LLMs

computational redundancy

context compression

parallel decoding

MASK tokens

Innovation

Methods, ideas, or system contributions that make the work stand out.

diffusion LLM

context compression

position-preserving