MetaState: Persistent Working Memory for Discrete Diffusion Language Models

📅 2026-03-01
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the issue of fragmented cross-step information—termed “information islands”—in discrete diffusion language models, which arises because denoising relies solely on the current hard-masked sequence and discards intermediate continuous representations, leading to redundant computation and inconsistent generation. To mitigate this, the authors propose MetaState, the first persistent working memory mechanism tailored for discrete diffusion language models. MetaState integrates cross-step information via a fixed-size memory bank, independent of sequence length, through a three-module architecture comprising a cross-attention Mixer, a GRU-style Updater, and a cross-attention Injector. Trained with a K-step unrolled fine-tuning strategy while keeping the backbone model frozen, MetaState introduces only a minimal number of trainable parameters yet significantly improves generation accuracy on LLaDA-8B and Dream-7B, demonstrating the efficacy of cross-step memory in enhancing generation quality.

Technology Category

Application Category

📝 Abstract
Discrete diffusion language models (dLLMs) generate text by iteratively denoising a masked sequence. Compared with autoregressive models, this paradigm naturally supports parallel decoding, bidirectional context, and flexible generation patterns. However, standard dLLMs condition each denoising step only on the current hard-masked sequence, while intermediate continuous representations are discarded after sampling and remasking. We refer to this bottleneck as the \textbf{Information Island} problem. It leads to redundant recomputation across steps and can degrade cross-step consistency. We address this limitation with \textbf{MetaState}, a lightweight recurrent augmentation that equips a frozen dLLM backbone with a persistent, fixed-size working memory that remains independent of sequence length. \textbf{MetaState} consists of three trainable modules: a cross-attention Mixer that reads backbone activations into memory slots, a GRU-style Updater that integrates information across denoising steps, and a cross-attention Injector that feeds the updated memory back into backbone activations. We train these modules with $K$-step unrolling to expose them to multi-step denoising dynamics during fine-tuning. On LLaDA-8B and Dream-7B, \textbf{MetaState} introduces negligible trainable parameters while keeping the backbone frozen, and it consistently improves accuracy over frozen baselines. These results demonstrate that persistent cross-step memory is an effective mechanism for bridging denoising steps and improving generation quality in discrete diffusion language models.
Problem

Research questions and friction points this paper is trying to address.

discrete diffusion language models
Information Island
persistent memory
cross-step consistency
denoising steps
Innovation

Methods, ideas, or system contributions that make the work stand out.

discrete diffusion language models
persistent memory
working memory
information island
cross-step consistency
🔎 Similar Papers
No similar papers found.
Kejing Xia
Kejing Xia
Georgia Institute of Technology
Mingzhe Li
Mingzhe Li
Ph.D Student @ Umass Amherst
AI Security
L
Lixuan Wei
Harvard University
Z
Zhenbang Du
Georgia Institute of Technology
Xiangchi Yuan
Xiangchi Yuan
Georgia Institute of Technology
Representation Learning
Q
Qirui Jin
Georgia Institute of Technology
W
Wenke Lee
Georgia Institute of Technology