🤖 AI Summary
This work addresses the underexplored memory risks in diffusion language models, which prior evaluations have largely confined to prefix-conditioned extraction, overlooking the broader data leakage potential inherent in their bidirectional denoising mechanism. The authors propose a novel “fill-in-the-blank” extraction protocol that introduces the concept of mask geometry, enabling systematic assessment of train-data extractability through arbitrary binary masks that simulate realistic attack scenarios. Experiments on LLaDA-8B and Dream-7B across five masking patterns, three training pipelines, and diverse corpora reveal that edge-conditioned masks recover verbatim sequences at roughly three times the rate of prefix-conditioned ones. Even after supervised fine-tuning, models retain substantial memorized content, particularly posing significant privacy risks in reconstructing redacted personal information. These findings demonstrate that bidirectional context dramatically enhances extraction success, presenting a serious challenge to privacy preservation.
📝 Abstract
Memorization in large language models has been studied almost exclusively through prefix-conditioned extraction, a natural choice for autoregressive models. However, diffusion language models (DLMs) can denoise masked tokens at arbitrary positions. Thus, prefix-only probing reveals only one facet of memorization in DLMs and significantly underestimates the risk of training-data extraction. In order to realistically model extractability of training data in DLMs, we introduce \emph{infilling extraction}, a data-extraction protocol parameterized by an arbitrary binary mask that subsumes prefix-only probing and accounts for the bidirectional inductive bias of DLMs. Instantiating it on LLaDA-8B and Dream-7B across five extraction modes, three training pipelines, and three corpora covering verbatim and partial leakage, we find that mask geometry governs extractability: edge-conditioned masks \emph{extract up to three times more} verbatim sequences than prefix-conditioned ones, and bidirectional access opens channels inaccessible in autoregressive models. In particular, we show that a realistic adversary with access to training data where personally identifiable information has been redacted, can even achieve higher recall on extracting redacted email addresses from DLMs than from scale-matched autoregressive models. Tunable parameters for decoding measurably affect extraction performance, while a follow-up supervised finetuning stage does not eliminate the prior memorization.