LLaDA-o: An Effective and Length-Adaptive Omni Diffusion Model

📅 2026-03-01
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the challenges of unified modeling in multimodal understanding and generation, particularly the difficulty of joint representation learning and the limitation of fixed output lengths. To this end, the authors propose a Mixture of Diffusion (MoD) framework that decouples discrete masked diffusion for text from continuous diffusion for vision while sharing a common attention backbone to enable efficient multimodal fusion. Furthermore, they introduce a data-driven, length-adaptive mechanism that requires no architectural modifications, enabling flexible, variable-length multimodal generation within a unified diffusion model for the first time. The proposed method achieves state-of-the-art performance on multimodal benchmarks, attaining a text-to-image generation score of 87.04 on DPG-Bench.

Technology Category

Application Category

📝 Abstract
We present \textbf{LLaDA-o}, an effective and length-adaptive omni diffusion model for multimodal understanding and generation. LLaDA-o is built on a Mixture of Diffusion (MoD) framework that decouples discrete masked diffusion for text understanding and continuous diffusion for visual generation, while coupling them through a shared, simple, and efficient attention backbone that reduces redundant computation for fixed conditions. Building on MoD, we further introduce a data-centric length adaptation strategy that enables flexible-length decoding in multimodal settings without architectural changes. Extensive experiments show that LLaDA-o achieves state-of-the-art performance among omni-diffusion models on multimodal understanding and generation benchmarks, and reaches 87.04 on DPG-Bench for text-to-image generation, supporting the effectiveness of unified omni diffusion modeling. Code is available at https://github.com/ML-GSAI/LLaDA-o.
Problem

Research questions and friction points this paper is trying to address.

multimodal understanding
visual generation
length-adaptive decoding
omni diffusion model
text-to-image generation
Innovation

Methods, ideas, or system contributions that make the work stand out.

Mixture of Diffusion
length-adaptive decoding
omni diffusion model
multimodal understanding and generation
shared attention backbone
🔎 Similar Papers