Lumina-DiMOO: An Omni Diffusion Large Language Model for Multi-Modal Generation and Understanding

📅 2025-10-07

📈 Citations: 0

✨ Influential: 0

career value

182K/year

🤖 AI Summary

This work addresses the limitations of existing multimodal foundation models in task coverage breadth and sampling efficiency. We propose the first fully discrete diffusion architecture, unifying text-to-image generation, image-to-image translation (including editing, subject-driven generation, and inpainting), and visual understanding within a shared discrete latent space—thereby departing from conventional autoregressive or continuous hybrid paradigms. This design enhances both sampling speed and modeling consistency across tasks. Evaluated on multiple multimodal benchmarks, our model achieves state-of-the-art performance among open-source foundation models. Crucially, the architecture enables coherent joint modeling of generation and understanding without task-specific heads or modality-specific adaptations. All model weights and training/inference code are publicly released to foster community advancement in efficient, unified multimodal foundation models.

Technology Category

Application Category

📝 Abstract

We introduce Lumina-DiMOO, an open-source foundational model for seamless multi-modal generation and understanding. Lumina-DiMOO sets itself apart from prior unified models by utilizing a fully discrete diffusion modeling to handle inputs and outputs across various modalities. This innovative approach allows Lumina-DiMOO to achieve higher sampling efficiency compared to previous autoregressive (AR) or hybrid AR-Diffusion paradigms and adeptly support a broad spectrum of multi-modal tasks, including text-to-image generation, image-to-image generation (e.g., image editing, subject-driven generation, and image inpainting, etc.), as well as image understanding. Lumina-DiMOO achieves state-of-the-art performance on multiple benchmarks, surpassing existing open-source unified multi-modal models. To foster further advancements in multi-modal and discrete diffusion model research, we release our code and checkpoints to the community. Project Page: https://synbol.github.io/Lumina-DiMOO.

Problem

Research questions and friction points this paper is trying to address.

Handles multi-modal inputs and outputs seamlessly

Achieves higher sampling efficiency than previous paradigms

Supports broad spectrum of generation and understanding tasks

Innovation

Methods, ideas, or system contributions that make the work stand out.

Uses fully discrete diffusion modeling for multi-modal tasks

Achieves higher sampling efficiency than autoregressive methods

Supports text-to-image generation and image understanding

🔎 Similar Papers

No similar papers found.