RewardHarness: Self-Evolving Agentic Post-Training

📅 2026-05-09

📈 Citations: 0

✨ Influential: 0

career value

216K/year

🤖 AI Summary

Existing reward models rely heavily on large-scale preference annotations and additional training, making it challenging to efficiently capture nuanced human preferences for image editing outcomes. This work proposes a self-evolving agent-based reward framework that departs from conventional weight-optimization paradigms by reframing reward modeling as a contextual evolution process. Leveraging an Orchestrator–Sub-Agent architecture, the approach constructs inference chains using frozen sub-agents and dynamically updates its tool and skill repository based on judgment outcomes. Remarkably, with only 100 preference samples—merely 0.05% of the EditReward dataset—the method achieves an average accuracy of 47.4% without any further human annotation, outperforming GPT-5 by 5.3 points. When employed as a reward signal in GRPO, the resulting RL-finetuned model attains a score of 3.52 on ImgEdit-Bench.

📝 Abstract

Evaluating instruction-guided image edits requires rewards that reflect subtle human preferences, yet current reward models typically depend on large-scale preference annotation and additional model training. This creates a data-efficiency gap: humans can often infer the target evaluation criteria from only a few examples, while models are usually trained on hundreds of thousands of comparisons. We present RewardHarness, a self-evolving agentic reward framework that reframes reward modeling as context evolution rather than weight optimization. Instead of learning from large-scale annotations, RewardHarness aligns with human preferences by iteratively evolving a library of tools and skills from as few as 100 preference demonstrations. Given a source image, candidate edited images, and an editing instruction, an Orchestrator selects the most relevant subset of tools and skills from the maintained library, and a frozen Sub-Agent uses them to construct a reasoning chain that produces a preference judgment. By comparing predicted judgments with ground-truth preferences and analyzing successes and failures in the reasoning process, the Orchestrator automatically refines its library of tools and skills without additional human annotation. Using only 0.05% of the EditReward preference data, RewardHarness achieves 47.4% average accuracy on image-editing evaluation benchmarks, surpassing GPT-5 by 5.3 points. When used as a reward signal for GRPO fine-tuning, RL-tuned models achieve 3.52 on ImgEdit-Bench. Project page: https://rewardharness.com.

Problem

Research questions and friction points this paper is trying to address.

reward modeling

image editing

preference learning

data efficiency

human alignment

Innovation

Methods, ideas, or system contributions that make the work stand out.

self-evolving agentic reward

context evolution

tool and skill library