🤖 AI Summary
Existing models struggle to efficiently balance multimodal understanding and generation within a unified framework. This work proposes a natively integrated discrete diffusion large language model that, for the first time, enables interleaved multimodal reasoning and generation through SigLIP-VQ visual discretization, a Mixture-of-Experts (MoE) backbone architecture, and a block-wise masked diffusion decoder. The approach introduces prefix-aware optimization along with parallel and few-step distillation inference strategies, substantially enhancing computational efficiency and scalability. The resulting model matches specialized vision-language models in multimodal understanding tasks while achieving strong performance in high-fidelity image generation and editing, thereby supporting efficient and unified multimodal content creation.
📝 Abstract
We present LLaDA2.0-Uni, a unified discrete diffusion large language model (dLLM) that supports multimodal understanding and generation within a natively integrated framework. Its architecture combines a fully semantic discrete tokenizer, a MoE-based dLLM backbone, and a diffusion decoder. By discretizing continuous visual inputs via SigLIP-VQ, the model enables block-level masked diffusion for both text and vision inputs within the backbone, while the decoder reconstructs visual tokens into high-fidelity images. Inference efficiency is enhanced beyond parallel decoding through prefix-aware optimizations in the backbone and few-step distillation in the decoder. Supported by carefully curated large-scale data and a tailored multi-stage training pipeline, LLaDA2.0-Uni matches specialized VLMs in multimodal understanding while delivering strong performance in image generation and editing. Its native support for interleaved generation and reasoning establishes a promising and scalable paradigm for next-generation unified foundation models. Codes and models are available at https://github.com/inclusionAI/LLaDA2.0-Uni.