A Rubric-Supervised Critic from Sparse Real-World Outcomes

📅 2026-03-04
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
In real-world settings, success signals for agents are often sparse, delayed, and noisy, posing significant challenges for effective training and evaluation. This work proposes a semi-supervised framework based on Critic Rubrics that automatically extracts 24 behavioral features from human–agent interaction trajectories and leverages minimal human feedback to train a “critic” model for either reward modeling in reinforcement learning or trajectory re-ranking at inference time. Notably, this approach achieves effective supervision using only sparse ground-truth outcomes, substantially narrowing the gap between academic benchmarks and real-world scenarios. On SWE-bench, it improves Best@8 by 15.9 points over Random@8, enables early stopping—reducing the number of attempts by 83% while gaining 17.7 points—and facilitates high-quality training data selection.

Technology Category

Application Category

📝 Abstract
Academic benchmarks for coding agents tend to reward autonomous task completion, measured by verifiable rewards such as unit-test success. In contrast, real-world coding agents operate with humans in the loop, where success signals are typically noisy, delayed, and sparse. How can we bridge this gap? In this paper, we propose a process to learn a "critic" model from sparse and noisy interaction data, which can then be used both as a reward model for either RL-based training or inference-time scaling. Specifically, we introduce Critic Rubrics, a rubric-based supervision framework with 24 behavioral features that can be derived from human-agent interaction traces alone. Using a semi-supervised objective, we can then jointly predict these rubrics and sparse human feedback (when present). In experiments, we demonstrate that, despite being trained primarily from trace-observable rubrics and sparse real-world outcome proxies, these critics improve best-of-N reranking on SWE-bench (Best@8 +15.9 over Random@8 over the rerankable subset of trajectories), enable early stopping (+17.7 with 83% fewer attempts), and support training-time data curation via critic-selected trajectories.
Problem

Research questions and friction points this paper is trying to address.

sparse feedback
real-world outcomes
coding agents
reward modeling
human-in-the-loop
Innovation

Methods, ideas, or system contributions that make the work stand out.

Critic Rubrics
sparse feedback
semi-supervised learning
reward modeling
human-in-the-loop
🔎 Similar Papers
No similar papers found.