Coefficients-Preserving Sampling for Reinforcement Learning with Flow Matching

📅 2025-09-07

📈 Citations: 0

✨ Influential: 0

career value

180K/year

🤖 AI Summary

Flow Matching (FM) models suffer from excessive noise injection during stochastic differential equation (SDE)-based sampling in reinforcement learning (RL) optimization, leading to image artifacts that corrupt reward modeling. To address this, we propose Coefficients-Preserving Sampling (CPS), a theoretically grounded denoising sampling method. CPS preserves critical coefficients from the SDE discretization scheme while explicitly eliminating redundant stochasticity, thereby reconstructing cleaner generation trajectories. Inspired by DDIM and fully compatible with the FM framework, CPS supports end-to-end joint training with RL optimizers such as Flow-GRPO and Dance-GRPO. Experiments demonstrate that CPS substantially suppresses noise-induced artifacts in generated images, improves the fidelity and accuracy of reward signals, accelerates RL convergence, and enhances training stability.

Technology Category

Application Category

📝 Abstract

Reinforcement Learning (RL) has recently emerged as a powerful technique for improving image and video generation in Diffusion and Flow Matching models, specifically for enhancing output quality and alignment with prompts. A critical step for applying online RL methods on Flow Matching is the introduction of stochasticity into the deterministic framework, commonly realized by Stochastic Differential Equation (SDE). Our investigation reveals a significant drawback to this approach: SDE-based sampling introduces pronounced noise artifacts in the generated images, which we found to be detrimental to the reward learning process. A rigorous theoretical analysis traces the origin of this noise to an excess of stochasticity injected during inference. To address this, we draw inspiration from Denoising Diffusion Implicit Models (DDIM) to reformulate the sampling process. Our proposed method, Coefficients-Preserving Sampling (CPS), eliminates these noise artifacts. This leads to more accurate reward modeling, ultimately enabling faster and more stable convergence for reinforcement learning-based optimizers like Flow-GRPO and Dance-GRPO. Code will be released at https://github.com/IamCreateAI/FlowCPS

Problem

Research questions and friction points this paper is trying to address.

SDE-based sampling introduces noise artifacts in generated images

Excess stochasticity during inference harms reward learning process

Need to eliminate noise for accurate reward modeling

Innovation

Methods, ideas, or system contributions that make the work stand out.

Reformulates sampling using DDIM inspiration

Eliminates noise artifacts in generated images

Enables accurate reward modeling for RL

🔎 Similar Papers

Reusing Historical Trajectories in Natural Policy Gradient via Importance Sampling: Convergence and Convergence Rate