Advantage Weighted Matching: Aligning RL with Pretraining in Diffusion Models

📅 2025-09-29
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing diffusion policy optimization methods (e.g., DDPO) suffer from misalignment between their reinforcement learning (RL) objectives and the pretraining score/flow matching objectives, leading to high gradient variance and slow convergence. This work introduces Advantage Weighted Matching (AWM), the first method to reveal that DDPO implicitly performs noisy score matching and to unify policy gradients with pretraining objectives: it reweights samples using the advantage function and directly reuses the original score/flow matching loss. This design substantially reduces gradient variance and accelerates convergence, with strong theoretical and empirical consistency. On Stable Diffusion 3.5 Medium and FLUX, AWM achieves up to 24× faster training than Flow-GRPO while preserving generation quality, attaining state-of-the-art performance across GenEval, OCR, and PickScore benchmarks.

Technology Category

Application Category

📝 Abstract
Reinforcement Learning (RL) has emerged as a central paradigm for advancing Large Language Models (LLMs), where pre-training and RL post-training share the same log-likelihood formulation. In contrast, recent RL approaches for diffusion models, most notably Denoising Diffusion Policy Optimization (DDPO), optimize an objective different from the pretraining objectives--score/flow matching loss. In this work, we establish a novel theoretical analysis: DDPO is an implicit form of score/flow matching with noisy targets, which increases variance and slows convergence. Building on this analysis, we introduce extbf{Advantage Weighted Matching (AWM)}, a policy-gradient method for diffusion. It uses the same score/flow-matching loss as pretraining to obtain a lower-variance objective and reweights each sample by its advantage. In effect, AWM raises the influence of high-reward samples and suppresses low-reward ones while keeping the modeling objective identical to pretraining. This unifies pretraining and RL conceptually and practically, is consistent with policy-gradient theory, reduces variance, and yields faster convergence. This simple yet effective design yields substantial benefits: on GenEval, OCR, and PickScore benchmarks, AWM delivers up to a $24 imes$ speedup over Flow-GRPO (which builds on DDPO), when applied to Stable Diffusion 3.5 Medium and FLUX, without compromising generation quality. Code is available at https://github.com/scxue/advantage_weighted_matching.
Problem

Research questions and friction points this paper is trying to address.

Aligns RL objectives with pretraining loss in diffusion models
Reduces variance and accelerates convergence in policy optimization
Reweights samples by advantage while maintaining pretraining consistency
Innovation

Methods, ideas, or system contributions that make the work stand out.

AWM uses score matching loss for lower variance
It reweights samples by advantage for efficiency
Unifies pretraining and reinforcement learning objectives
🔎 Similar Papers
No similar papers found.