LLaDA-MoE: A Sparse MoE Diffusion Language Model

📅 2025-09-29

📈 Citations: 0

✨ Influential: 0

career value

236K/year

🤖 AI Summary

To address the high computational cost of diffusion language models (DLMs), this work proposes LLaDA-MoE—the first successful integration of a sparse Mixture-of-Experts (MoE) architecture into a masked diffusion language model. Trained from scratch on 20 trillion tokens, the model has a total of 7 billion parameters but activates only 1.4 billion during inference, substantially reducing computational overhead. Its core innovation lies in the co-design of sparse MoE with the diffusion modeling objective, instruction fine-tuning, and large-scale autoregressive pretraining strategies. On multiple benchmarks, LLaDA-MoE outperforms existing DLMs; its instruction-tuned variant achieves performance on par with Qwen2.5-3B-Instruct in knowledge understanding, code generation, and mathematical reasoning. This work establishes a new paradigm for efficient diffusion-based language modeling.

Technology Category

Application Category

📝 Abstract

We introduce LLaDA-MoE, a large language diffusion model with the Mixture-of-Experts (MoE) architecture, trained from scratch on approximately 20T tokens. LLaDA-MoE achieves competitive performance with significantly reduced computational overhead by maintaining a 7B-parameter capacity while activating only 1.4B parameters during inference. Our empirical evaluation reveals that LLaDA-MoE achieves state-of-the-art performance among diffusion language models with larger parameters, surpassing previous diffusion language models LLaDA, LLaDA 1.5, and Dream across multiple benchmarks. The instruct-tuned model LLaDA-MoE-7B-A1B-Instruct demonstrates capabilities comparable to Qwen2.5-3B-Instruct in knowledge understanding, code generation, mathematical reasoning, agent and alignment tasks, despite using fewer active parameters. Our results show that integrating a sparse MoE architecture into the training objective of masked diffusion language models still brings out MoE's strengths under efficient inference with few active parameters, and opens ample room for further exploration of diffusion language models. LLaDA-MoE models are available at Huggingface.

Problem

Research questions and friction points this paper is trying to address.

Developing sparse MoE diffusion language model with reduced computational overhead

Achieving state-of-the-art performance while activating fewer parameters

Integrating MoE architecture into masked diffusion language model training

Innovation

Methods, ideas, or system contributions that make the work stand out.

Sparse Mixture-of-Experts architecture for diffusion language model

Maintains 7B parameters but activates only 1.4B during inference

Integrates MoE into masked diffusion language model training objective

🔎 Similar Papers

DiffuseDef: Improved Robustness to Adversarial Attacks