🤖 AI Summary
To address weak generalization, high task-specific adaptation costs, and reliance on reinforcement learning (RL) in automated evaluation of reasoning models, this paper proposes a multi-task universal evaluation paradigm. We construct a large-scale, cross-domain reasoning evaluation dataset comprising 2.5 million samples and design FARE, a generative evaluator based on a sparse-activation architecture (8B/20B parameters). Leveraging multi-task supervised fine-tuning and iterative rejection sampling, FARE is the first single model to uniformly support diverse evaluation tasks—including pairwise comparison, step-level verification, and single-score prediction—without task-specific architectural modifications. Experiments demonstrate that FARE-20B achieves near-oracle performance on MATH, outperforming specialized 70B RL-based evaluators; when employed as an RL validator, it improves baseline model performance by 14.1%; and its continuously fine-tuned variant surpasses GPT-OSS-20B by 65% on code evaluation benchmarks.
📝 Abstract
Finetuning specialized generative evaluators has emerged as a popular paradigm to meet the increasing demand for scalable evaluation during both training and test-time. However, recent work has largely focused on applying new methodology, such as reinforcement learning (RL), to training evaluators, shying away from large-scale, data-driven development. In this work, we focus on data scaling, curating a set of 2.5M samples spanning five unique evaluation tasks (pairwise, step-level, reference-free and reference-based verification, and single rating) and multiple domains focused on reasoning evaluation. With our data, we train Foundational Automatic Reasoning Evaluators (FARE), a family of 8B and 20B (with 3.6B active) parameter evaluators, with a simple iterative rejection-sampling supervised finetuning (SFT) approach. FARE-8B challenges larger specialized RL-trained evaluators and FARE-20B sets the new standard for open-source evaluators, surpassing specialized 70B+ evaluators. Beyond static benchmarks, we evaluate FARE in real-world tasks: As inference-time rerankers, FARE-20B achieves near-oracle performance on MATH. As verifiers in RL training, FARE improves the downstream RL-trained model performance by up to 14.1% vs. string-matching verifiers. When initialized from FARE, a continually-finetuned FARE-Code outperforms gpt-oss-20B by 65% on evaluating test-case quality.