Lavida-O: Elastic Masked Diffusion Models for Unified Multimodal Understanding and Generation

📅 2025-09-23
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing unified multimodal diffusion models suffer from image-level understanding limitations, low-resolution generation capabilities, and an absence of explicit task planning—hindering object localization, fine-grained editing, and high-resolution synthesis. To address these challenges, we propose the first unified masked diffusion model enabling tight understanding-generation co-optimization. Our approach introduces a novel planning and iterative self-reflection mechanism that dynamically refines generation or editing based on visual comprehension. We further design an elastic hybrid Transformer architecture, universal text conditioning, and hierarchical sampling to jointly optimize training efficiency and inference speed. Extensive evaluations on RefCOCO (localization), GenEval (generation), and ImgEdit (editing) benchmarks demonstrate state-of-the-art performance, consistently outperforming Qwen2.5-VL and FluxKontext-dev. Notably, our method achieves significant inference acceleration without compromising quality.

Technology Category

Application Category

📝 Abstract
We proposed Lavida-O, a unified multi-modal Masked Diffusion Model (MDM) capable of image understanding and generation tasks. Unlike existing multimodal diffsion language models such as MMaDa and Muddit which only support simple image-level understanding tasks and low-resolution image generation, Lavida-O exhibits many new capabilities such as object grounding, image-editing, and high-resolution (1024px) image synthesis. It is also the first unified MDM that uses its understanding capabilities to improve image generation and editing results through planning and iterative self-reflection. To allow effective and efficient training and sampling, Lavida-O ntroduces many novel techniques such as Elastic Mixture-of-Transformer architecture, universal text conditioning, and stratified sampling. ours~achieves state-of-the-art performance on a wide range of benchmarks such as RefCOCO object grounding, GenEval text-to-image generation, and ImgEdit image editing, outperforming existing autoregressive and continuous diffusion models such as Qwen2.5-VL and FluxKontext-dev, while offering considerable speedup at inference.
Problem

Research questions and friction points this paper is trying to address.

Unifying multimodal understanding and generation in a single model
Overcoming limitations of existing models in resolution and task variety
Enhancing image generation through understanding-based planning and reflection
Innovation

Methods, ideas, or system contributions that make the work stand out.

Elastic Mixture-of-Transformer architecture for multimodal tasks
Stratified sampling technique for efficient training and inference
Iterative self-reflection using understanding to improve generation
🔎 Similar Papers
No similar papers found.