🤖 AI Summary
This work addresses the challenges of unified modeling in multimodal understanding and generation, particularly the difficulty of joint representation learning and the limitation of fixed output lengths. To this end, the authors propose a Mixture of Diffusion (MoD) framework that decouples discrete masked diffusion for text from continuous diffusion for vision while sharing a common attention backbone to enable efficient multimodal fusion. Furthermore, they introduce a data-driven, length-adaptive mechanism that requires no architectural modifications, enabling flexible, variable-length multimodal generation within a unified diffusion model for the first time. The proposed method achieves state-of-the-art performance on multimodal benchmarks, attaining a text-to-image generation score of 87.04 on DPG-Bench.
📝 Abstract
We present \textbf{LLaDA-o}, an effective and length-adaptive omni diffusion model for multimodal understanding and generation. LLaDA-o is built on a Mixture of Diffusion (MoD) framework that decouples discrete masked diffusion for text understanding and continuous diffusion for visual generation, while coupling them through a shared, simple, and efficient attention backbone that reduces redundant computation for fixed conditions. Building on MoD, we further introduce a data-centric length adaptation strategy that enables flexible-length decoding in multimodal settings without architectural changes. Extensive experiments show that LLaDA-o achieves state-of-the-art performance among omni-diffusion models on multimodal understanding and generation benchmarks, and reaches 87.04 on DPG-Bench for text-to-image generation, supporting the effectiveness of unified omni diffusion modeling. Code is available at https://github.com/ML-GSAI/LLaDA-o.