🤖 AI Summary
This study addresses the low reliability of language model (LM)-based automatic evaluation on high-difficulty tasks—such as Olympiad-level mathematics and research-level physics—where standard LMs often lack sufficient domain knowledge or reasoning capacity. To overcome this, we propose a privileged-information-augmented evaluation paradigm: by injecting ground-truth answers and solution guidelines into the evaluator’s context, even general-purpose, relatively weak LMs can accurately assess stronger, more capable models. Methodologically, we integrate privileged-information-guided prompt engineering with problem simplification strategies to construct a lightweight, task-agnostic automated scoring framework. Our key contribution is challenging the implicit assumption that evaluators must outperform evaluatees—thereby substantially expanding the capability frontier of LM-based evaluation and improving model discriminability. Experiments demonstrate state-of-the-art performance on RewardBench, superior agreement with human judgments compared to individual human raters on Vibe-Eval, and near-expert-level inter-rater consistency on Olympiad mathematics tasks.
📝 Abstract
Auto-evaluating language models (LMs), i.e., using a grader LM to evaluate the candidate LM, is an appealing way to accelerate the evaluation process and the cost associated with it. But this presents a paradox: how can we trust the grader LM, which is presumably weaker than the candidate LM, to assess problems that are beyond the frontier of the capabilities of either model or both? For instance, today's LMs struggle on graduate-level physics and Olympiad-level math, making them unreliable graders in these domains. We show that providing privileged information -- such as ground-truth solutions or problem-specific guidelines -- improves automated evaluations on such frontier problems. This approach offers two key advantages. First, it expands the range of problems where LMs graders apply. Specifically, weaker models can now rate the predictions of stronger models. Second, privileged information can be used to devise easier variations of challenging problems which improves the separability of different LMs on tasks where their performance is generally low. With this approach, general-purpose LM graders match the state of the art performance on RewardBench, surpassing almost all the specially-tuned models. LM graders also outperform individual human raters on Vibe-Eval, and approach human expert graders on Olympiad-level math problems.