dFlowGRPO: Rate-Aware Policy Optimization for Discrete Flow Models

📅 2026-05-09

📈 Citations: 0

✨ Influential: 0

career value

183K/year

🤖 AI Summary

Existing reinforcement learning approaches struggle to effectively leverage transition rates and posterior information to optimize general discrete flow models. This work addresses this limitation by modeling the denoising process as a Markov decision process and introducing dFlowGRPO, a unified reinforcement learning framework that extends rate-based policy optimization—previously limited to specific settings—to general discrete flow models for the first time. By integrating conditional transition rates with posterior information, dFlowGRPO supports arbitrary probability paths and non-masked source distributions. Empirical results demonstrate that the proposed method significantly outperforms existing dLLM-based GRPO approaches in text-to-image generation tasks, achieving performance on par with continuous flow models, while also exhibiting strong capabilities in multimodal understanding benchmarks.

📝 Abstract

Discrete flow models (DFMs) are a class of flexible generative models for generating discrete data, and diffusion large language models (dLLMs) can be viewed as a special case with a specific choice of mixture path and a masked source distribution. While several recent works have explored reinforcement learning into dLLMs, its application to more general discrete flow models remains underexplored. In this work, we present discrete Flow-GRPO (dFlowGRPO), a unified reinforcement learning framework for discrete flow models that supports a broad family of probability paths and non-masked source distributions. We derive the full trajectory probability for DFMs and formulate denoising as a Markov decision process, enabling dFlowGRPO to incorporate information from both the associated conditional transition rates and the posterior model during reinforcement learning. We apply dFlowGRPO to FUDOKI, a recent multimodal discrete flow model, and evaluate it on both image generation and multimodal understanding tasks. Empirical results show that dFlowGRPO outperforms existing GRPO-type methods for dLLMs on text-to-image generation tasks and achieves performance competitive with continuous flow-based models trained using FlowGRPO, while also demonstrating strong capabilities on understanding tasks.

Problem

Research questions and friction points this paper is trying to address.

Discrete Flow Models

Reinforcement Learning

Policy Optimization

dLLMs

Probability Paths

Innovation

Methods, ideas, or system contributions that make the work stand out.

discrete flow models

reinforcement learning

rate-aware policy optimization