🤖 AI Summary
Existing reward models rely heavily on large-scale preference annotations and additional training, making it challenging to efficiently capture nuanced human preferences for image editing outcomes. This work proposes a self-evolving agent-based reward framework that departs from conventional weight-optimization paradigms by reframing reward modeling as a contextual evolution process. Leveraging an Orchestrator–Sub-Agent architecture, the approach constructs inference chains using frozen sub-agents and dynamically updates its tool and skill repository based on judgment outcomes. Remarkably, with only 100 preference samples—merely 0.05% of the EditReward dataset—the method achieves an average accuracy of 47.4% without any further human annotation, outperforming GPT-5 by 5.3 points. When employed as a reward signal in GRPO, the resulting RL-finetuned model attains a score of 3.52 on ImgEdit-Bench.
📝 Abstract
Evaluating instruction-guided image edits requires rewards that reflect subtle human preferences, yet current reward models typically depend on large-scale preference annotation and additional model training. This creates a data-efficiency gap: humans can often infer the target evaluation criteria from only a few examples, while models are usually trained on hundreds of thousands of comparisons. We present RewardHarness, a self-evolving agentic reward framework that reframes reward modeling as context evolution rather than weight optimization. Instead of learning from large-scale annotations, RewardHarness aligns with human preferences by iteratively evolving a library of tools and skills from as few as 100 preference demonstrations. Given a source image, candidate edited images, and an editing instruction, an Orchestrator selects the most relevant subset of tools and skills from the maintained library, and a frozen Sub-Agent uses them to construct a reasoning chain that produces a preference judgment. By comparing predicted judgments with ground-truth preferences and analyzing successes and failures in the reasoning process, the Orchestrator automatically refines its library of tools and skills without additional human annotation. Using only 0.05% of the EditReward preference data, RewardHarness achieves 47.4% average accuracy on image-editing evaluation benchmarks, surpassing GPT-5 by 5.3 points. When used as a reward signal for GRPO fine-tuning, RL-tuned models achieve 3.52 on ImgEdit-Bench. Project page: https://rewardharness.com.