🤖 AI Summary
Existing autoregressive (AR) multimodal large language models (MLLMs) are constrained by causal modeling and raster-scan generation, hindering simultaneous high-fidelity visual understanding and image synthesis. This paper introduces the first purely discrete flow-matching-based unified multimodal model, abandoning the AR paradigm to enable bidirectional context modeling and synchronous understanding–generation. Key contributions include: (1) a metric-induced probability path derived from optimal transport dynamics, enabling self-correcting iterative generation; and (2) an AR-model transfer adaptation mechanism coupled with a test-time scaling strategy. Experiments demonstrate that our model matches state-of-the-art (SOTA) AR-MLLMs on both visual understanding and image generation benchmarks. With test-time scaling, performance improves significantly across tasks, and the model’s compatibility with reinforcement learning—validated through policy optimization experiments—confirms its scalability for downstream interactive applications.
📝 Abstract
The rapid progress of large language models (LLMs) has catalyzed the emergence of multimodal large language models (MLLMs) that unify visual understanding and image generation within a single framework. However, most existing MLLMs rely on autoregressive (AR) architectures, which impose inherent limitations on future development, such as the raster-scan order in image generation and restricted reasoning abilities in causal context modeling. In this work, we challenge the dominance of AR-based approaches by introducing FUDOKI, a unified multimodal model purely based on discrete flow matching, as an alternative to conventional AR paradigms. By leveraging metric-induced probability paths with kinetic optimal velocities, our framework goes beyond the previous masking-based corruption process, enabling iterative refinement with self-correction capability and richer bidirectional context integration during generation. To mitigate the high cost of training from scratch, we initialize FUDOKI from pre-trained AR-based MLLMs and adaptively transition to the discrete flow matching paradigm. Experimental results show that FUDOKI achieves performance comparable to state-of-the-art AR-based MLLMs across both visual understanding and image generation tasks, highlighting its potential as a foundation for next-generation unified multimodal models. Furthermore, we show that applying test-time scaling techniques to FUDOKI yields significant performance gains, further underscoring its promise for future enhancement through reinforcement learning.