🤖 AI Summary
This study addresses the fundamental disconnect between understanding and generation capabilities in multimodal generative AI. We propose a unified modeling paradigm that systematically characterizes the intrinsic trade-offs between autoregressive and diffusion-based modeling, as well as between dense and Mixture-of-Experts (MoE) architectures. Our approach integrates multimodal large language models (MLLMs), diffusion probabilistic modeling, MoE-based sparse computation, and cross-modal alignment mechanisms, grounded in large-scale multimodal pretraining data analysis. This enables precise delineation of modeling differences and complementary boundaries between the two dominant paradigms. The work yields an extensible “understanding–generation” joint modeling decision atlas, offering both theoretical foundations and practical design principles for efficient, unified multimodal generative AI systems.
📝 Abstract
Multi-modal generative AI has received increasing attention in both academia and industry. Particularly, two dominant families of techniques are: i) The multi-modal large language model (MLLM) such as GPT-4V, which shows impressive ability for multi-modal understanding; ii) The diffusion model such as Sora, which exhibits remarkable multi-modal powers, especially with respect to visual generation. As such, one natural question arises: Is it possible to have a unified model for both understanding and generation? To answer this question, in this paper, we first provide a detailed review of both MLLM and diffusion models, including their probabilistic modeling procedure, multi-modal architecture design, and advanced applications to image/video large language models as well as text-to-image/video generation. Then, we discuss the two important questions on the unified model: i) whether the unified model should adopt the auto-regressive or diffusion probabilistic modeling, and ii) whether the model should utilize a dense architecture or the Mixture of Experts(MoE) architectures to better support generation and understanding, two objectives. We further provide several possible strategies for building a unified model and analyze their potential advantages and disadvantages. We also summarize existing large-scale multi-modal datasets for better model pretraining in the future. To conclude the paper, we present several challenging future directions, which we believe can contribute to the ongoing advancement of multi-modal generative AI.