LLaDA-o: An Effective and Length-Adaptive Omni Diffusion Model

📅 2026-03-01

📈 Citations: 0

✨ Influential: 0

career value

197K/year

🤖 AI Summary

This work addresses the challenges of unified modeling in multimodal understanding and generation, particularly the difficulty of joint representation learning and the limitation of fixed output lengths. To this end, the authors propose a Mixture of Diffusion (MoD) framework that decouples discrete masked diffusion for text from continuous diffusion for vision while sharing a common attention backbone to enable efficient multimodal fusion. Furthermore, they introduce a data-driven, length-adaptive mechanism that requires no architectural modifications, enabling flexible, variable-length multimodal generation within a unified diffusion model for the first time. The proposed method achieves state-of-the-art performance on multimodal benchmarks, attaining a text-to-image generation score of 87.04 on DPG-Bench.

Technology Category

Application Category

📝 Abstract

We present \textbf{LLaDA-o}, an effective and length-adaptive omni diffusion model for multimodal understanding and generation. LLaDA-o is built on a Mixture of Diffusion (MoD) framework that decouples discrete masked diffusion for text understanding and continuous diffusion for visual generation, while coupling them through a shared, simple, and efficient attention backbone that reduces redundant computation for fixed conditions. Building on MoD, we further introduce a data-centric length adaptation strategy that enables flexible-length decoding in multimodal settings without architectural changes. Extensive experiments show that LLaDA-o achieves state-of-the-art performance among omni-diffusion models on multimodal understanding and generation benchmarks, and reaches 87.04 on DPG-Bench for text-to-image generation, supporting the effectiveness of unified omni diffusion modeling. Code is available at https://github.com/ML-GSAI/LLaDA-o.

Problem

Research questions and friction points this paper is trying to address.

multimodal understanding

visual generation

length-adaptive decoding

omni diffusion model

text-to-image generation

Innovation

Methods, ideas, or system contributions that make the work stand out.

Mixture of Diffusion

length-adaptive decoding

omni diffusion model