Omni-Diffusion: Unified Multimodal Understanding and Generation with Masked Discrete Diffusion

📅 2026-03-06

📈 Citations: 0

✨ Influential: 0

career value

211K/year

🤖 AI Summary

This work proposes the first arbitrary-to-arbitrary multimodal language model based on a masked discrete diffusion framework, addressing the limitations of prevailing autoregressive architectures that struggle to unify and efficiently handle both comprehension and generation tasks across modalities. By employing a unified architecture to jointly model the discrete token distributions of text, speech, and images, the model supports not only pairwise but also more complex multimodal interactions. This approach transcends the constraints inherent in autoregressive paradigms and achieves competitive or superior performance against existing systems across multiple multimodal benchmarks, thereby demonstrating the potential of diffusion models as a foundational architecture for next-generation multimodal AI.

Technology Category

Application Category

📝 Abstract

While recent multimodal large language models (MLLMs) have made impressive strides, they predominantly employ a conventional autoregressive architecture as their backbone, leaving significant room to explore effective and efficient alternatives in architectural design. Concurrently, recent studies have successfully applied discrete diffusion models to various domains, such as visual understanding and image generation, revealing their considerable potential as a promising backbone for multimodal systems. Drawing inspiration from these pioneering research, we introduce Omni-Diffusion, the first any-to-any multimodal language model built entirely on mask-based discrete diffusion models, which unifies understanding and generation across text, speech, and images. Omni-Diffusion employs a unified mask-based discrete diffusion model to directly capture the joint distribution over discrete multimodal tokens. This approach supports not only bimodal tasks but also more complex scenarios involving multiple modalities. On a diverse set of benchmarks, our method outperforms or performs on par with existing multimodal systems that process two or more modalities, highlighting the significant promise of diffusion models in powering the next generation of multimodal foundation models. Project webpage: https://omni-diffusion.github.io.

Problem

Research questions and friction points this paper is trying to address.

multimodal

diffusion models

autoregressive architecture

unified understanding and generation

discrete tokens

Innovation

Methods, ideas, or system contributions that make the work stand out.

masked discrete diffusion

multimodal foundation model

any-to-any generation