Reward Model Generalization for Compute-Aware Test-Time Reasoning

📅 2025-05-23

📈 Citations: 0

✨ Influential: 0

career value

174K/year

🤖 AI Summary

This work addresses the Test-time Compute Optimization (TCO) problem: improving answer accuracy of large language models under a fixed inference budget, specifically by decoupling external generation and selection phases. We propose Compute-Aware Tree Search (CATS), an actor-critic–style framework that jointly performs multi-path generation, tree search, and reinforcement learning. Crucially, CATS is the first to establish a PAC-Bayes theoretical analysis linking reward model (PRM) generalization error to sample complexity, enabling dynamic budget allocation guided by reward distribution statistics and sparsity. The method balances accuracy and efficiency without increasing model parameters or training overhead. On the MATH and AIME benchmarks, CATS significantly outperforms existing external test-time search methods—achieving substantial accuracy gains under identical compute budgets. These results empirically validate that controlling PRM generalization error is critical for effective TCO.

Technology Category

Application Category

📝 Abstract

External test-time reasoning enhances large language models (LLMs) by decoupling generation and selection. At inference time, the model generates multiple reasoning paths, and an auxiliary process reward model (PRM) is used to score and select the best one. A central challenge in this setting is test-time compute optimality (TCO), i.e., how to maximize answer accuracy under a fixed inference budget. In this work, we establish a theoretical framework to analyze how the generalization error of the PRM affects compute efficiency and reasoning performance. Leveraging PAC-Bayes theory, we derive generalization bounds and show that a lower generalization error of PRM leads to fewer samples required to find correct answers. Motivated by this analysis, we propose Compute-Aware Tree Search (CATS), an actor-critic framework that dynamically controls search behavior. The actor outputs sampling hyperparameters based on reward distributions and sparsity statistics, while the critic estimates their utility to guide budget allocation. Experiments on the MATH and AIME benchmarks with various LLMs and PRMs demonstrate that CATS consistently outperforms other external TTS methods, validating our theoretical predictions.

Problem

Research questions and friction points this paper is trying to address.

Analyze PRM generalization error impact on compute efficiency

Maximize answer accuracy under fixed inference budget

Dynamic search control for optimal test-time reasoning

Innovation

Methods, ideas, or system contributions that make the work stand out.

Uses auxiliary reward model for path selection

Proposes Compute-Aware Tree Search framework

Dynamically controls search via actor-critic

🔎 Similar Papers

Interpretable Contrastive Monte Carlo Tree Search Reasoning