Soft-Masked Diffusion Language Models

📅 2025-10-20

📈 Citations: 0

✨ Influential: 0

career value

209K/year

🤖 AI Summary

Traditional masked diffusion language models employ binary masking decisions during decoding, causing predictive information at masked positions to be entirely discarded. To address this, we propose Soft-Mask Diffusion LM, which replaces hard mask-retain/replace operations with a learnable, continuous weighting mechanism that dynamically fuses top-$k$ predicted token embeddings at masked positions—thereby preserving and propagating predictive information across parallel iterative denoising steps. We design dedicated continual pretraining and fine-tuning strategies. Experiments on a 169M-parameter model show reduced perplexity and improved MAUVE scores; on Dream-7B and Dream-Coder-7B, our method significantly enhances performance across diverse programming tasks—especially under high-throughput settings. The core innovation lies in generalizing discrete masking decisions into a continuous embedding fusion mechanism, effectively balancing generation quality and inference efficiency.

Technology Category

Application Category

📝 Abstract

Diffusion models have demonstrated strong potential in language modeling, offering various advantages over traditional autoregressive approaches. Their ability to generate and revise entire responses in parallel enables faster generation and built-in self-correction mechanisms. Most modern diffusion-based language models employ masked diffusion, where decoding involves iteratively processing masked tokens based on a binary decision: either retaining the mask or replacing it with the predicted token. However, this binary choice discards valuable predictive information when the mask is retained. To address this limitation, we introduce soft-masking (SM), a novel method that dynamically blends the embedding of the mask token with the embeddings of the top-$k$ predicted tokens from the previous decoding step, for each retained mask. This provides the model with a more informative prior, preserving context from earlier computations and allowing partial information about masked tokens to propagate beyond a single step. We propose a training methodology that adapts a pretrained masked diffusion language model to incorporate SM. We demonstrate that continuing pretraining a 169M parameter model with SM leads to improved perplexity and MAUVE scores. Furthermore, we finetune two state-of-the-art diffusion models, Dream-7B and Dream-Coder-7B, with SM. SM consistently improves performance across multiple coding benchmarks, particularly in high-throughput settings.

Problem

Research questions and friction points this paper is trying to address.

Improving masked diffusion language models' predictive information retention

Enabling partial token information propagation across decoding steps

Enhancing diffusion model performance on perplexity and coding tasks

Innovation

Methods, ideas, or system contributions that make the work stand out.

Soft-masking blends mask embeddings with top predicted tokens

Training adapts pretrained diffusion models with soft-masking

Method improves performance on perplexity and coding benchmarks

🔎 Similar Papers

No similar papers found.