🤖 AI Summary
Existing unified multimodal diffusion models suffer from image-level understanding limitations, low-resolution generation capabilities, and an absence of explicit task planning—hindering object localization, fine-grained editing, and high-resolution synthesis. To address these challenges, we propose the first unified masked diffusion model enabling tight understanding-generation co-optimization. Our approach introduces a novel planning and iterative self-reflection mechanism that dynamically refines generation or editing based on visual comprehension. We further design an elastic hybrid Transformer architecture, universal text conditioning, and hierarchical sampling to jointly optimize training efficiency and inference speed. Extensive evaluations on RefCOCO (localization), GenEval (generation), and ImgEdit (editing) benchmarks demonstrate state-of-the-art performance, consistently outperforming Qwen2.5-VL and FluxKontext-dev. Notably, our method achieves significant inference acceleration without compromising quality.
📝 Abstract
We proposed Lavida-O, a unified multi-modal Masked Diffusion Model (MDM) capable of image understanding and generation tasks. Unlike existing multimodal diffsion language models such as MMaDa and Muddit which only support simple image-level understanding tasks and low-resolution image generation, Lavida-O exhibits many new capabilities such as object grounding, image-editing, and high-resolution (1024px) image synthesis. It is also the first unified MDM that uses its understanding capabilities to improve image generation and editing results through planning and iterative self-reflection. To allow effective and efficient training and sampling, Lavida-O ntroduces many novel techniques such as Elastic Mixture-of-Transformer architecture, universal text conditioning, and stratified sampling. ours~achieves state-of-the-art performance on a wide range of benchmarks such as RefCOCO object grounding, GenEval text-to-image generation, and ImgEdit image editing, outperforming existing autoregressive and continuous diffusion models such as Qwen2.5-VL and FluxKontext-dev, while offering considerable speedup at inference.