🤖 AI Summary
This work addresses the issue of fragmented cross-step information—termed “information islands”—in discrete diffusion language models, which arises because denoising relies solely on the current hard-masked sequence and discards intermediate continuous representations, leading to redundant computation and inconsistent generation. To mitigate this, the authors propose MetaState, the first persistent working memory mechanism tailored for discrete diffusion language models. MetaState integrates cross-step information via a fixed-size memory bank, independent of sequence length, through a three-module architecture comprising a cross-attention Mixer, a GRU-style Updater, and a cross-attention Injector. Trained with a K-step unrolled fine-tuning strategy while keeping the backbone model frozen, MetaState introduces only a minimal number of trainable parameters yet significantly improves generation accuracy on LLaDA-8B and Dream-7B, demonstrating the efficacy of cross-step memory in enhancing generation quality.
📝 Abstract
Discrete diffusion language models (dLLMs) generate text by iteratively denoising a masked sequence. Compared with autoregressive models, this paradigm naturally supports parallel decoding, bidirectional context, and flexible generation patterns. However, standard dLLMs condition each denoising step only on the current hard-masked sequence, while intermediate continuous representations are discarded after sampling and remasking. We refer to this bottleneck as the \textbf{Information Island} problem. It leads to redundant recomputation across steps and can degrade cross-step consistency. We address this limitation with \textbf{MetaState}, a lightweight recurrent augmentation that equips a frozen dLLM backbone with a persistent, fixed-size working memory that remains independent of sequence length. \textbf{MetaState} consists of three trainable modules: a cross-attention Mixer that reads backbone activations into memory slots, a GRU-style Updater that integrates information across denoising steps, and a cross-attention Injector that feeds the updated memory back into backbone activations. We train these modules with $K$-step unrolling to expose them to multi-step denoising dynamics during fine-tuning. On LLaDA-8B and Dream-7B, \textbf{MetaState} introduces negligible trainable parameters while keeping the backbone frozen, and it consistently improves accuracy over frozen baselines. These results demonstrate that persistent cross-step memory is an effective mechanism for bridging denoising steps and improving generation quality in discrete diffusion language models.