Reward Sharpness-Aware Fine-Tuning for Diffusion Models

📅 2026-03-22

📈 Citations: 0

✨ Influential: 0

career value

245K/year

🤖 AI Summary

This work addresses the reward hacking problem in diffusion models trained with reinforcement learning from human feedback (RLHF), where inflated reward scores fail to correspond to genuine improvements in generation quality. To mitigate this issue without retraining the reward model, the authors propose RSA-FT, a method that jointly applies parameter and sample perturbations to construct a smoother reward gradient landscape, thereby robustifying gradients from the reward model. Central to this approach is a reward sharpness-aware optimization mechanism that effectively suppresses reward hacking during diffusion model fine-tuning. Experimental results demonstrate that both individual and combined perturbation strategies significantly enhance generation quality and alignment stability, thereby improving the reliability and practicality of the RDRL framework.

Technology Category

Application Category

📝 Abstract

Reinforcement learning from human feedback (RLHF) has proven effective in aligning large language models with human preferences, inspiring the development of reward-centric diffusion reinforcement learning (RDRL) to achieve similar alignment and controllability. While diffusion models can generate high-quality outputs, RDRL remains susceptible to reward hacking, where the reward score increases without corresponding improvements in perceptual quality. We demonstrate that this vulnerability arises from the non-robustness of reward model gradients, particularly when the reward landscape with respect to the input image is sharp. To mitigate this issue, we introduce methods that exploit gradients from a robustified reward model without requiring its retraining. Specifically, we employ gradients from a flattened reward model, obtained through parameter perturbations of the diffusion model and perturbations of its generated samples. Empirically, each method independently alleviates reward hacking and improves robustness, while their joint use amplifies these benefits. Our resulting framework, RSA-FT (Reward Sharpness-Aware Fine-Tuning), is simple, broadly compatible, and consistently enhances the reliability of RDRL.

Problem

Research questions and friction points this paper is trying to address.

reward hacking

diffusion models

reinforcement learning from human feedback

reward robustness

perceptual quality

Innovation

Methods, ideas, or system contributions that make the work stand out.

reward hacking

diffusion models

robust optimization