🤖 AI Summary
As language models (LMs) generate increasingly natural outputs, their quality assessment has become substantially more challenging; while test-time computation scaling—e.g., “thinking time”—has been shown to improve generation performance on mathematical and coding tasks, its role in *evaluation* remains unexplored.
Method: This work introduces the first systematic study of scalable evaluation-time computation, proposing a novel paradigm wherein chain-of-thought (CoT) models serve as evaluators: they jointly assess final outputs and intermediate reasoning steps at multiple granularities, and employ generative re-ranking to equate evaluation and generation computational cost.
Contribution/Results: The core innovation is a monotonic, step-wise scalable evaluation mechanism whose performance strictly improves with increasing inference steps. Experiments on mathematical and code-solving benchmarks demonstrate that scaling evaluation computation significantly boosts final problem-solving success rates—matching the gains achieved by equivalently scaled generation-time computation.
📝 Abstract
As language model (LM) outputs get more and more natural, it is becoming more difficult than ever to evaluate their quality. Simultaneously, increasing LMs'"thinking"time through scaling test-time compute has proven an effective technique to solve challenging problems in domains such as math and code. This raises a natural question: can an LM's evaluation capability also be improved by spending more test-time compute? To answer this, we investigate employing reasoning models-LMs that natively generate long chain-of-thought reasoning-as evaluators. Specifically, we examine methods to leverage more test-time compute by (1) using reasoning models, and (2) prompting these models to evaluate not only the response as a whole (i.e., outcome evaluation) but also assess each step in the response separately (i.e., process evaluation). In experiments, we observe that the evaluator's performance improves monotonically when generating more reasoning tokens, similar to the trends observed in LM-based generation. Furthermore, we use these more accurate evaluators to rerank multiple generations, and demonstrate that spending more compute at evaluation time can be as effective as using more compute at generation time in improving an LM's problem-solving capability.