Balancing Understanding and Generation in Discrete Diffusion Models

📅 2026-02-01

📈 Citations: 0

✨ Influential: 0

career value

196K/year

🤖 AI Summary

Discrete diffusion language models have long struggled to balance semantic understanding and generation quality: Masked Diffusion Language Models (MDLMs) excel in understanding but lag in generation, while Uniform-noise Diffusion Language Models (UDLMs) generate efficiently yet exhibit limited comprehension. This work proposes XDLM, a novel framework that unifies the masked and uniform-noise diffusion paradigms through a stationary noise kernel, achieving theoretical consistency while alleviating memory bottlenecks via algebraic simplification of posterior probabilities. XDLM simultaneously advances both objectives, outperforming UDLM by 5.4 points in zero-shot text understanding and achieving a 54.1 FID in few-step image generation—substantially better than MDLM’s 80.8. Moreover, an 8B-parameter XDLM attains a 15.0 MBPP score within 32 steps, doubling prior performance and significantly advancing the Pareto frontier of discrete diffusion models.

Technology Category

Application Category

📝 Abstract

In discrete generative modeling, two dominant paradigms demonstrate divergent capabilities: Masked Diffusion Language Models (MDLM) excel at semantic understanding and zero-shot generalization, whereas Uniform-noise Diffusion Language Models (UDLM) achieve strong few-step generation quality, yet neither attains balanced performance across both dimensions. To address this, we propose XDLM, which bridges the two paradigms via a stationary noise kernel. XDLM offers two key contributions: (1) it provides a principled theoretical unification of MDLM and UDLM, recovering each paradigm as a special case; and (2) an alleviated memory bottleneck enabled by an algebraic simplification of the posterior probabilities. Experiments demonstrate that XDLM advances the Pareto frontier between understanding capability and generation quality. Quantitatively, XDLM surpasses UDLM by 5.4 points on zero-shot text benchmarks and outperforms MDLM in few-step image generation (FID 54.1 vs. 80.8). When scaled to tune an 8B-parameter large language model, XDLM achieves 15.0 MBPP in just 32 steps, effectively doubling the baseline performance. Finally, analysis of training dynamics reveals XDLM's superior potential for long-term scaling. Code is available at https://github.com/MzeroMiko/XDLM

Problem

Research questions and friction points this paper is trying to address.

discrete diffusion models

semantic understanding

generation quality

zero-shot generalization

few-step generation

Innovation

Methods, ideas, or system contributions that make the work stand out.

discrete diffusion models

stationary noise kernel

theoretical unification