RationalRewards: Reasoning Rewards Scale Visual Generation Both Training and Test Time

📅 2026-04-13

📈 Citations: 0

✨ Influential: 0

career value

190K/year

🤖 AI Summary

This work addresses the limitation of conventional vision-based reward models, which reduce human preferences to a single scalar score while neglecting the underlying reasoning process, thereby constraining optimization during both training and inference. To overcome this, the authors propose the Preference-Anchored Rationalization (PARROT) framework, which for the first time integrates structured reasoning into reward modeling. PARROT recovers multi-dimensional, interpretable critiques from existing preference data through preference-anchored generation, consistency filtering, and knowledge distillation—without requiring additional annotations. During training, it constructs fine-grained reinforcement learning rewards; at test time, it enables a “generate–critique–refine” loop. The resulting RationalRewards (8B) achieves state-of-the-art preference prediction performance among open-source reward models, matching Gemini-2.5-Pro despite using 10–20× less training data, and significantly outperforms conventional RL fine-tuning in test-time optimization.

Technology Category

Application Category

📝 Abstract

Most reward models for visual generation reduce rich human judgments to a single unexplained score, discarding the reasoning that underlies preference. We show that teaching reward models to produce explicit, multi-dimensional critiques before scoring transforms them from passive evaluators into active optimization tools, improving generators in two complementary ways: at training time, structured rationales provide interpretable, fine-grained rewards for reinforcement learning; at test time, a Generate-Critique-Refine loop turns critiques into targeted prompt revisions that improve outputs without any parameter updates. To train such a reward model without costly rationale annotations, we introduce Preference-Anchored Rationalization (PARROT), a principled framework that recovers high-quality rationales from readily available preference data through anchored generation, consistency filtering, and distillation. The resulting model, RationalRewards (8B), achieves state-of-the-art preference prediction among open-source reward models, competitive with Gemini-2.5-Pro, while using 10-20x less training data than comparable baselines. As an RL reward, it consistently improves text-to-image and image-editing generators beyond scalar alternatives. Most strikingly, its test-time critique-and-refine loop matches or exceeds RL-based fine-tuning on several benchmarks, suggesting that structured reasoning can unlock latent capabilities in existing generators that suboptimal prompts fail to elicit.

Problem

Research questions and friction points this paper is trying to address.

reward models

visual generation

reasoning

preference judgment

prompt refinement

Innovation

Methods, ideas, or system contributions that make the work stand out.

RationalRewards

structured rationales

Preference-Anchored Rationalization