🤖 AI Summary
Reinforcement learning (RL) post-training of existing vision generative models is hindered by reward modeling—typically reliant on large-scale human preference annotations or hand-crafted, incomplete, and costly quality metrics. Method: We propose GAN-RM, a novel reward modeling paradigm that uses unpaired target samples as implicit preference proxies, eliminating both human annotation and explicit quality metric design. Its core is a binary-classification reward model inspired by GAN-style adversarial discrimination, trained efficiently from only hundreds of representative target samples. Contribution/Results: GAN-RM seamlessly integrates with mainstream RL post-training pipelines—including Best-of-N filtering, supervised fine-tuning (SFT), and direct preference optimization (DPO)—and demonstrates substantial improvements in practicality, scalability, and engineering efficiency across diverse vision generation tasks.
📝 Abstract
An effective reward model plays a pivotal role in reinforcement learning for post-training enhancement of visual generative models. However, current approaches of reward modeling suffer from implementation complexity due to their reliance on extensive human-annotated preference data or meticulously engineered quality dimensions that are often incomplete and engineering-intensive. Inspired by adversarial training in generative adversarial networks (GANs), this paper proposes GAN-RM, an efficient reward modeling framework that eliminates manual preference annotation and explicit quality dimension engineering. Our method trains the reward model through discrimination between a small set of representative, unpaired target samples(denoted as Preference Proxy Data) and model-generated ordinary outputs, requiring only a few hundred target samples. Comprehensive experiments demonstrate our GAN-RM's effectiveness across multiple key applications including test-time scaling implemented as Best-of-N sample filtering, post-training approaches like Supervised Fine-Tuning (SFT) and Direct Preference Optimization (DPO).