LLaDA-MoE: A Sparse MoE Diffusion Language Model

πŸ“… 2025-09-29
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
To address the high computational cost of diffusion language models (DLMs), this work proposes LLaDA-MoEβ€”the first successful integration of a sparse Mixture-of-Experts (MoE) architecture into a masked diffusion language model. Trained from scratch on 20 trillion tokens, the model has a total of 7 billion parameters but activates only 1.4 billion during inference, substantially reducing computational overhead. Its core innovation lies in the co-design of sparse MoE with the diffusion modeling objective, instruction fine-tuning, and large-scale autoregressive pretraining strategies. On multiple benchmarks, LLaDA-MoE outperforms existing DLMs; its instruction-tuned variant achieves performance on par with Qwen2.5-3B-Instruct in knowledge understanding, code generation, and mathematical reasoning. This work establishes a new paradigm for efficient diffusion-based language modeling.

Technology Category

Application Category

πŸ“ Abstract
We introduce LLaDA-MoE, a large language diffusion model with the Mixture-of-Experts (MoE) architecture, trained from scratch on approximately 20T tokens. LLaDA-MoE achieves competitive performance with significantly reduced computational overhead by maintaining a 7B-parameter capacity while activating only 1.4B parameters during inference. Our empirical evaluation reveals that LLaDA-MoE achieves state-of-the-art performance among diffusion language models with larger parameters, surpassing previous diffusion language models LLaDA, LLaDA 1.5, and Dream across multiple benchmarks. The instruct-tuned model LLaDA-MoE-7B-A1B-Instruct demonstrates capabilities comparable to Qwen2.5-3B-Instruct in knowledge understanding, code generation, mathematical reasoning, agent and alignment tasks, despite using fewer active parameters. Our results show that integrating a sparse MoE architecture into the training objective of masked diffusion language models still brings out MoE's strengths under efficient inference with few active parameters, and opens ample room for further exploration of diffusion language models. LLaDA-MoE models are available at Huggingface.
Problem

Research questions and friction points this paper is trying to address.

Developing sparse MoE diffusion language model with reduced computational overhead
Achieving state-of-the-art performance while activating fewer parameters
Integrating MoE architecture into masked diffusion language model training
Innovation

Methods, ideas, or system contributions that make the work stand out.

Sparse Mixture-of-Experts architecture for diffusion language model
Maintains 7B parameters but activates only 1.4B during inference
Integrates MoE into masked diffusion language model training objective
πŸ”Ž Similar Papers
F
Fengqi Zhu
Renmin University of China, Ant Group
Zebin You
Zebin You
renmin university of china
generative modeldiffusion modelsemi-supervised learningself-supervised learning
Y
Yipeng Xing
Ant Group
Zenan Huang
Zenan Huang
Ant Research
Machine LearningCausalityLLMs
L
Lin Liu
Ant Group
Y
Yihong Zhuang
Ant Group
Guoshan Lu
Guoshan Lu
Zhejiang University
Machine LearningLLM
K
Kangyu Wang
Ant Group, Shanghai Jiao Tong University
X
Xudong Wang
Ant Group
L
Lanning Wei
Ant Group
H
Hongrui Guo
Ant Group
Jiaqi Hu
Jiaqi Hu
Rice University; Genentech
Artificial IntelligenceDeep Learning
Wentao Ye
Wentao Ye
Zhejiang University, Ant Research
LLMsMachine LearningMultimodality
Tieyuan Chen
Tieyuan Chen
Shanghai Jiao Tong University
Computer VisionVideo UnderstandingCausal DiscoveryCausal Reasoning
C
Chenchen Li
Ant Group
C
Chengfu Tang
Ant Group
Haibo Feng
Haibo Feng
Assistant Professor at University of British Columbia (PhD, MASc)
LCASustainable ConstructionBuilding MaterialsMass TimberDigital technologies
J
Jun Hu
Ant Group
J
Jun Zhou
Ant Group
X
Xiaolu Zhang
Ant Group
Zhenzhong Lan
Zhenzhong Lan
School of Engineering, Westlake University
NLPComputer VisionMultimedia
J
Junbo Zhao
Ant Group, Zhejiang University
Da Zheng
Da Zheng
Amazon
High-performance computingData-intensive computingLarge-scale machine learningGraph neural networks
Chongxuan Li
Chongxuan Li
Associate Professor, Renmin University of China
Machine LearningGenerative ModelsDeep Learning
Jianguo Li
Jianguo Li
Director, Ant Group
deep learningcomputer visionmachine learningsystem