Foundational Automatic Evaluators: Scaling Multi-Task Generative Evaluator Training for Reasoning-Centric Domains

📅 2025-10-20
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address weak generalization, high task-specific adaptation costs, and reliance on reinforcement learning (RL) in automated evaluation of reasoning models, this paper proposes a multi-task universal evaluation paradigm. We construct a large-scale, cross-domain reasoning evaluation dataset comprising 2.5 million samples and design FARE, a generative evaluator based on a sparse-activation architecture (8B/20B parameters). Leveraging multi-task supervised fine-tuning and iterative rejection sampling, FARE is the first single model to uniformly support diverse evaluation tasks—including pairwise comparison, step-level verification, and single-score prediction—without task-specific architectural modifications. Experiments demonstrate that FARE-20B achieves near-oracle performance on MATH, outperforming specialized 70B RL-based evaluators; when employed as an RL validator, it improves baseline model performance by 14.1%; and its continuously fine-tuned variant surpasses GPT-OSS-20B by 65% on code evaluation benchmarks.

Technology Category

Application Category

📝 Abstract
Finetuning specialized generative evaluators has emerged as a popular paradigm to meet the increasing demand for scalable evaluation during both training and test-time. However, recent work has largely focused on applying new methodology, such as reinforcement learning (RL), to training evaluators, shying away from large-scale, data-driven development. In this work, we focus on data scaling, curating a set of 2.5M samples spanning five unique evaluation tasks (pairwise, step-level, reference-free and reference-based verification, and single rating) and multiple domains focused on reasoning evaluation. With our data, we train Foundational Automatic Reasoning Evaluators (FARE), a family of 8B and 20B (with 3.6B active) parameter evaluators, with a simple iterative rejection-sampling supervised finetuning (SFT) approach. FARE-8B challenges larger specialized RL-trained evaluators and FARE-20B sets the new standard for open-source evaluators, surpassing specialized 70B+ evaluators. Beyond static benchmarks, we evaluate FARE in real-world tasks: As inference-time rerankers, FARE-20B achieves near-oracle performance on MATH. As verifiers in RL training, FARE improves the downstream RL-trained model performance by up to 14.1% vs. string-matching verifiers. When initialized from FARE, a continually-finetuned FARE-Code outperforms gpt-oss-20B by 65% on evaluating test-case quality.
Problem

Research questions and friction points this paper is trying to address.

Scaling generative evaluators for reasoning-centric domains through data-driven training
Addressing limitations in current evaluator training by focusing on data scaling
Developing foundational automatic evaluators for multiple reasoning evaluation tasks
Innovation

Methods, ideas, or system contributions that make the work stand out.

Training large-scale evaluators with 2.5M multi-task samples
Using iterative rejection-sampling SFT for parameter-efficient models
Achieving state-of-the-art performance as verifiers and rerankers
🔎 Similar Papers
No similar papers found.