Fake it till You Make it: Reward Modeling as Discriminative Prediction

📅 2025-06-16
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Reinforcement learning (RL) post-training of existing vision generative models is hindered by reward modeling—typically reliant on large-scale human preference annotations or hand-crafted, incomplete, and costly quality metrics. Method: We propose GAN-RM, a novel reward modeling paradigm that uses unpaired target samples as implicit preference proxies, eliminating both human annotation and explicit quality metric design. Its core is a binary-classification reward model inspired by GAN-style adversarial discrimination, trained efficiently from only hundreds of representative target samples. Contribution/Results: GAN-RM seamlessly integrates with mainstream RL post-training pipelines—including Best-of-N filtering, supervised fine-tuning (SFT), and direct preference optimization (DPO)—and demonstrates substantial improvements in practicality, scalability, and engineering efficiency across diverse vision generation tasks.

Technology Category

Application Category

📝 Abstract
An effective reward model plays a pivotal role in reinforcement learning for post-training enhancement of visual generative models. However, current approaches of reward modeling suffer from implementation complexity due to their reliance on extensive human-annotated preference data or meticulously engineered quality dimensions that are often incomplete and engineering-intensive. Inspired by adversarial training in generative adversarial networks (GANs), this paper proposes GAN-RM, an efficient reward modeling framework that eliminates manual preference annotation and explicit quality dimension engineering. Our method trains the reward model through discrimination between a small set of representative, unpaired target samples(denoted as Preference Proxy Data) and model-generated ordinary outputs, requiring only a few hundred target samples. Comprehensive experiments demonstrate our GAN-RM's effectiveness across multiple key applications including test-time scaling implemented as Best-of-N sample filtering, post-training approaches like Supervised Fine-Tuning (SFT) and Direct Preference Optimization (DPO).
Problem

Research questions and friction points this paper is trying to address.

Simplifying reward model implementation in reinforcement learning
Reducing reliance on human-annotated preference data
Eliminating manual quality dimension engineering
Innovation

Methods, ideas, or system contributions that make the work stand out.

Uses GAN-inspired adversarial training for reward modeling
Eliminates need for human-annotated preference data
Requires only few hundred unpaired target samples
🔎 Similar Papers
No similar papers found.