Foundational Automatic Evaluators: Scaling Multi-Task Generative Evaluator Training for Reasoning-Centric Domains

📅 2025-10-20

📈 Citations: 0

✨ Influential: 0

career value

170K/year

🤖 AI Summary

To address weak generalization, high task-specific adaptation costs, and reliance on reinforcement learning (RL) in automated evaluation of reasoning models, this paper proposes a multi-task universal evaluation paradigm. We construct a large-scale, cross-domain reasoning evaluation dataset comprising 2.5 million samples and design FARE, a generative evaluator based on a sparse-activation architecture (8B/20B parameters). Leveraging multi-task supervised fine-tuning and iterative rejection sampling, FARE is the first single model to uniformly support diverse evaluation tasks—including pairwise comparison, step-level verification, and single-score prediction—without task-specific architectural modifications. Experiments demonstrate that FARE-20B achieves near-oracle performance on MATH, outperforming specialized 70B RL-based evaluators; when employed as an RL validator, it improves baseline model performance by 14.1%; and its continuously fine-tuned variant surpasses GPT-OSS-20B by 65% on code evaluation benchmarks.

Technology Category

Application Category

📝 Abstract

Finetuning specialized generative evaluators has emerged as a popular paradigm to meet the increasing demand for scalable evaluation during both training and test-time. However, recent work has largely focused on applying new methodology, such as reinforcement learning (RL), to training evaluators, shying away from large-scale, data-driven development. In this work, we focus on data scaling, curating a set of 2.5M samples spanning five unique evaluation tasks (pairwise, step-level, reference-free and reference-based verification, and single rating) and multiple domains focused on reasoning evaluation. With our data, we train Foundational Automatic Reasoning Evaluators (FARE), a family of 8B and 20B (with 3.6B active) parameter evaluators, with a simple iterative rejection-sampling supervised finetuning (SFT) approach. FARE-8B challenges larger specialized RL-trained evaluators and FARE-20B sets the new standard for open-source evaluators, surpassing specialized 70B+ evaluators. Beyond static benchmarks, we evaluate FARE in real-world tasks: As inference-time rerankers, FARE-20B achieves near-oracle performance on MATH. As verifiers in RL training, FARE improves the downstream RL-trained model performance by up to 14.1% vs. string-matching verifiers. When initialized from FARE, a continually-finetuned FARE-Code outperforms gpt-oss-20B by 65% on evaluating test-case quality.

Problem

Research questions and friction points this paper is trying to address.

Scaling generative evaluators for reasoning-centric domains through data-driven training

Addressing limitations in current evaluator training by focusing on data scaling

Developing foundational automatic evaluators for multiple reasoning evaluation tasks

Innovation

Methods, ideas, or system contributions that make the work stand out.

Training large-scale evaluators with 2.5M multi-task samples

Using iterative rejection-sampling SFT for parameter-efficient models

Achieving state-of-the-art performance as verifiers and rerankers

🔎 Similar Papers

No similar papers found.