Understanding Sampler Stochasticity in Training Diffusion Models for RLHF

📅 2025-10-12
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the reward gap arising from train-inference mismatch—stochastic sampling during RLHF fine-tuning versus deterministic sampling at inference—in diffusion models. We provide the first theoretical characterization of the reward gap in diffusion models, deriving non-vacuous bounds and establishing improved convergence rates under both VE and VP SDE frameworks. To bridge this gap, we propose the high-stochasticity generalized DDIM (gDDIM) framework, enabling a unified training-inference paradigm across arbitrary noise levels. Integrating DDPO with MixGRPO, our method achieves consistent optimization between SDE-based training and ODE-based inference. Extensive text-to-image experiments demonstrate that the reward gap steadily converges during training, and high-stochasticity SDE training significantly enhances generation quality and human preference scores under deterministic ODE inference.

Technology Category

Application Category

📝 Abstract
Reinforcement Learning from Human Feedback (RLHF) is increasingly used to fine-tune diffusion models, but a key challenge arises from the mismatch between stochastic samplers used during training and deterministic samplers used during inference. In practice, models are fine-tuned using stochastic SDE samplers to encourage exploration, while inference typically relies on deterministic ODE samplers for efficiency and stability. This discrepancy induces a reward gap, raising concerns about whether high-quality outputs can be expected during inference. In this paper, we theoretically characterize this reward gap and provide non-vacuous bounds for general diffusion models, along with sharper convergence rates for Variance Exploding (VE) and Variance Preserving (VP) Gaussian models. Methodologically, we adopt the generalized denoising diffusion implicit models (gDDIM) framework to support arbitrarily high levels of stochasticity, preserving data marginals throughout. Empirically, our findings through large-scale experiments on text-to-image models using denoising diffusion policy optimization (DDPO) and mixed group relative policy optimization (MixGRPO) validate that reward gaps consistently narrow over training, and ODE sampling quality improves when models are updated using higher-stochasticity SDE training.
Problem

Research questions and friction points this paper is trying to address.

Addresses mismatch between stochastic training and deterministic inference samplers
Characterizes reward gap induced by sampler discrepancy in diffusion models
Proposes framework to narrow reward gap through controlled stochasticity training
Innovation

Methods, ideas, or system contributions that make the work stand out.

Uses gDDIM framework for high stochasticity
Applies DDPO and MixGRPO optimization methods
Theoretically characterizes reward gap with bounds
🔎 Similar Papers
2022-09-02ACM Computing SurveysCitations: 1628
2024-05-22Neural Information Processing SystemsCitations: 33