FUDOKI: Discrete Flow-based Unified Understanding and Generation via Kinetic-Optimal Velocities

📅 2025-05-26

📈 Citations: 0

✨ Influential: 0

career value

191K/year

🤖 AI Summary

Existing autoregressive (AR) multimodal large language models (MLLMs) are constrained by causal modeling and raster-scan generation, hindering simultaneous high-fidelity visual understanding and image synthesis. This paper introduces the first purely discrete flow-matching-based unified multimodal model, abandoning the AR paradigm to enable bidirectional context modeling and synchronous understanding–generation. Key contributions include: (1) a metric-induced probability path derived from optimal transport dynamics, enabling self-correcting iterative generation; and (2) an AR-model transfer adaptation mechanism coupled with a test-time scaling strategy. Experiments demonstrate that our model matches state-of-the-art (SOTA) AR-MLLMs on both visual understanding and image generation benchmarks. With test-time scaling, performance improves significantly across tasks, and the model’s compatibility with reinforcement learning—validated through policy optimization experiments—confirms its scalability for downstream interactive applications.

Technology Category

Application Category

📝 Abstract

The rapid progress of large language models (LLMs) has catalyzed the emergence of multimodal large language models (MLLMs) that unify visual understanding and image generation within a single framework. However, most existing MLLMs rely on autoregressive (AR) architectures, which impose inherent limitations on future development, such as the raster-scan order in image generation and restricted reasoning abilities in causal context modeling. In this work, we challenge the dominance of AR-based approaches by introducing FUDOKI, a unified multimodal model purely based on discrete flow matching, as an alternative to conventional AR paradigms. By leveraging metric-induced probability paths with kinetic optimal velocities, our framework goes beyond the previous masking-based corruption process, enabling iterative refinement with self-correction capability and richer bidirectional context integration during generation. To mitigate the high cost of training from scratch, we initialize FUDOKI from pre-trained AR-based MLLMs and adaptively transition to the discrete flow matching paradigm. Experimental results show that FUDOKI achieves performance comparable to state-of-the-art AR-based MLLMs across both visual understanding and image generation tasks, highlighting its potential as a foundation for next-generation unified multimodal models. Furthermore, we show that applying test-time scaling techniques to FUDOKI yields significant performance gains, further underscoring its promise for future enhancement through reinforcement learning.

Problem

Research questions and friction points this paper is trying to address.

Challenges autoregressive architectures in multimodal models

Introduces discrete flow matching for unified understanding and generation

Enables iterative refinement with self-correction and bidirectional context

Innovation

Methods, ideas, or system contributions that make the work stand out.

Discrete flow matching replaces autoregressive architectures

Kinetic-optimal velocities enable iterative self-correction

Adapts pre-trained AR models to flow paradigm

🔎 Similar Papers

No similar papers found.