Lumina-DiMOO: An Omni Diffusion Large Language Model for Multi-Modal Generation and Understanding

📅 2025-10-07
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the limitations of existing multimodal foundation models in task coverage breadth and sampling efficiency. We propose the first fully discrete diffusion architecture, unifying text-to-image generation, image-to-image translation (including editing, subject-driven generation, and inpainting), and visual understanding within a shared discrete latent space—thereby departing from conventional autoregressive or continuous hybrid paradigms. This design enhances both sampling speed and modeling consistency across tasks. Evaluated on multiple multimodal benchmarks, our model achieves state-of-the-art performance among open-source foundation models. Crucially, the architecture enables coherent joint modeling of generation and understanding without task-specific heads or modality-specific adaptations. All model weights and training/inference code are publicly released to foster community advancement in efficient, unified multimodal foundation models.

Technology Category

Application Category

📝 Abstract
We introduce Lumina-DiMOO, an open-source foundational model for seamless multi-modal generation and understanding. Lumina-DiMOO sets itself apart from prior unified models by utilizing a fully discrete diffusion modeling to handle inputs and outputs across various modalities. This innovative approach allows Lumina-DiMOO to achieve higher sampling efficiency compared to previous autoregressive (AR) or hybrid AR-Diffusion paradigms and adeptly support a broad spectrum of multi-modal tasks, including text-to-image generation, image-to-image generation (e.g., image editing, subject-driven generation, and image inpainting, etc.), as well as image understanding. Lumina-DiMOO achieves state-of-the-art performance on multiple benchmarks, surpassing existing open-source unified multi-modal models. To foster further advancements in multi-modal and discrete diffusion model research, we release our code and checkpoints to the community. Project Page: https://synbol.github.io/Lumina-DiMOO.
Problem

Research questions and friction points this paper is trying to address.

Handles multi-modal inputs and outputs seamlessly
Achieves higher sampling efficiency than previous paradigms
Supports broad spectrum of generation and understanding tasks
Innovation

Methods, ideas, or system contributions that make the work stand out.

Uses fully discrete diffusion modeling for multi-modal tasks
Achieves higher sampling efficiency than autoregressive methods
Supports text-to-image generation and image understanding
🔎 Similar Papers
No similar papers found.
Yi Xin
Yi Xin
California Institute of Technology
Industrial OrganizationEconometrics
Q
Qi Qin
Shanghai AI Laboratory, The University of Sydney
Siqi Luo
Siqi Luo
Shanghai Jiao Tong university
AIGCComputer VisionImage EditingAI4Science
Kaiwen Zhu
Kaiwen Zhu
Shanghai Jiao Tong University
Multi-Modal GenerationComputer Vision
J
Juncheng Yan
Shanghai AI Laboratory, Tsinghua University
Y
Yan Tai
Shanghai Jiao Tong University
J
Jiayi Lei
Shanghai AI Laboratory, Shanghai Jiao Tong University
Yuewen Cao
Yuewen Cao
The Chinese University of Hong Kong
K
Keqi Wang
Shanghai AI Laboratory
Yibin Wang
Yibin Wang
Intern at UIUC
Trustworthy AI
Jinbin Bai
Jinbin Bai
National University of Singapore
Machine LearningContent CreationGenerative Modeling
Qian Yu
Qian Yu
Professor, Dept of Earth, Geographic, and Climate Sciences, University of Massachusetts-Amherst
GISremote sensingSpatial modeling
Dengyang Jiang
Dengyang Jiang
Northwestern Polytechnical University
Computer VisionDeep LearningMachine Learning
Yuandong Pu
Yuandong Pu
SJTU,Shanghai AI Laboratory
Computer Vision
H
Haoxing Chen
Nanjing University
Le Zhuo
Le Zhuo
Krea AI
generative modelsmulti-modal learning
Junjun He
Junjun He
Shanghai Jiao Tong University
Gen Luo
Gen Luo
Shanghai AI Laboratory
computer visionvision and language
Tianbin Li
Tianbin Li
Shanghai Artificial Intelligence Laboratory
Machine LearningComputer VisionGeneral Intelligence
M
Ming Hu
Shanghai AI Laboratory
J
Jin Ye
Shanghai AI Laboratory
S
Shenglong Ye
Shanghai AI Laboratory
B
Bo Zhang
Shanghai AI Laboratory
C
Chang Xu
The University of Sydney
W
Wenhai Wang
Shanghai AI Laboratory