LLaDA2.0-Uni: Unifying Multimodal Understanding and Generation with Diffusion Large Language Model

📅 2026-04-22
📈 Citations: 0
Influential: 0
📄 PDF

career value

207K/year
🤖 AI Summary
Existing models struggle to efficiently balance multimodal understanding and generation within a unified framework. This work proposes a natively integrated discrete diffusion large language model that, for the first time, enables interleaved multimodal reasoning and generation through SigLIP-VQ visual discretization, a Mixture-of-Experts (MoE) backbone architecture, and a block-wise masked diffusion decoder. The approach introduces prefix-aware optimization along with parallel and few-step distillation inference strategies, substantially enhancing computational efficiency and scalability. The resulting model matches specialized vision-language models in multimodal understanding tasks while achieving strong performance in high-fidelity image generation and editing, thereby supporting efficient and unified multimodal content creation.

Technology Category

Application Category

📝 Abstract
We present LLaDA2.0-Uni, a unified discrete diffusion large language model (dLLM) that supports multimodal understanding and generation within a natively integrated framework. Its architecture combines a fully semantic discrete tokenizer, a MoE-based dLLM backbone, and a diffusion decoder. By discretizing continuous visual inputs via SigLIP-VQ, the model enables block-level masked diffusion for both text and vision inputs within the backbone, while the decoder reconstructs visual tokens into high-fidelity images. Inference efficiency is enhanced beyond parallel decoding through prefix-aware optimizations in the backbone and few-step distillation in the decoder. Supported by carefully curated large-scale data and a tailored multi-stage training pipeline, LLaDA2.0-Uni matches specialized VLMs in multimodal understanding while delivering strong performance in image generation and editing. Its native support for interleaved generation and reasoning establishes a promising and scalable paradigm for next-generation unified foundation models. Codes and models are available at https://github.com/inclusionAI/LLaDA2.0-Uni.
Problem

Research questions and friction points this paper is trying to address.

multimodal understanding
multimodal generation
unified foundation model
diffusion language model
vision-language integration
Innovation

Methods, ideas, or system contributions that make the work stand out.

discrete diffusion LLM
multimodal unification
masked block diffusion
MoE-based backbone
few-step distillation
🔎 Similar Papers
No similar papers found.
T
Tiwei Bie
AGI Research Center, Inclusion AI
H
Haoxing Chen
AGI Research Center, Inclusion AI
Tieyuan Chen
Tieyuan Chen
Shanghai Jiao Tong University
Computer VisionVideo UnderstandingCausal DiscoveryCausal Reasoning
Zhenglin Cheng
Zhenglin Cheng
Zhejiang University & Westlake University, SII
Multimodal LearningDiffusion Models
L
Long Cui
AGI Research Center, Inclusion AI
K
Kai Gan
AGI Research Center, Inclusion AI
Z
Zhicheng Huang
AGI Research Center, Inclusion AI
Zhenzhong Lan
Zhenzhong Lan
School of Engineering, Westlake University
NLPComputer VisionMultimedia
H
Haoquan Li
AGI Research Center, Inclusion AI
Jianguo Li
Jianguo Li
Director, Ant Group
deep learningcomputer visionmachine learningsystem
T
Tao Lin
AGI Research Center, Inclusion AI
Q
Qi Qin
AGI Research Center, Inclusion AI
H
Hongjun Wang
AGI Research Center, Inclusion AI
X
Xiaomei Wang
AGI Research Center, Inclusion AI
Haoyuan Wu
Haoyuan Wu
The Chinese University of Hong Kong
Generative AILarge Language ModelsMultimodal ModelsAgentic AIRepresentation Learning
Yi Xin
Yi Xin
California Institute of Technology
Industrial OrganizationEconometrics
J
Junbo Zhao
AGI Research Center, Inclusion AI
I
Inclusion AI
AGI Research Center, Inclusion AI