LLaDA2.0-Uni: Unifying Multimodal Understanding and Generation with Diffusion Large Language Model

📅 2026-04-22

📈 Citations: 0

✨ Influential: 0

career value

201K/year

🤖 AI Summary

Existing models struggle to efficiently balance multimodal understanding and generation within a unified framework. This work proposes a natively integrated discrete diffusion large language model that, for the first time, enables interleaved multimodal reasoning and generation through SigLIP-VQ visual discretization, a Mixture-of-Experts (MoE) backbone architecture, and a block-wise masked diffusion decoder. The approach introduces prefix-aware optimization along with parallel and few-step distillation inference strategies, substantially enhancing computational efficiency and scalability. The resulting model matches specialized vision-language models in multimodal understanding tasks while achieving strong performance in high-fidelity image generation and editing, thereby supporting efficient and unified multimodal content creation.

Technology Category

Application Category

📝 Abstract

We present LLaDA2.0-Uni, a unified discrete diffusion large language model (dLLM) that supports multimodal understanding and generation within a natively integrated framework. Its architecture combines a fully semantic discrete tokenizer, a MoE-based dLLM backbone, and a diffusion decoder. By discretizing continuous visual inputs via SigLIP-VQ, the model enables block-level masked diffusion for both text and vision inputs within the backbone, while the decoder reconstructs visual tokens into high-fidelity images. Inference efficiency is enhanced beyond parallel decoding through prefix-aware optimizations in the backbone and few-step distillation in the decoder. Supported by carefully curated large-scale data and a tailored multi-stage training pipeline, LLaDA2.0-Uni matches specialized VLMs in multimodal understanding while delivering strong performance in image generation and editing. Its native support for interleaved generation and reasoning establishes a promising and scalable paradigm for next-generation unified foundation models. Codes and models are available at https://github.com/inclusionAI/LLaDA2.0-Uni.

Problem

Research questions and friction points this paper is trying to address.

multimodal understanding

multimodal generation

unified foundation model

diffusion language model

vision-language integration

Innovation

Methods, ideas, or system contributions that make the work stand out.

discrete diffusion LLM

multimodal unification

masked block diffusion