Enhancing Large Language Model Reasoning with Reward Models: An Analytical Survey

📅 2025-10-02
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Reward models (RMs) play a dual role in enhancing large language model (LLM) reasoning—serving both as training signals for reinforcement learning fine-tuning and as selection mechanisms over multiple candidate answers during inference—yet their systematic evaluation, limitations, and robustness remain underexplored. Method: We propose a unified evaluation framework covering RM architecture design, generalization analysis, synthetic data generation, and iterative self-improvement mechanisms. We empirically diagnose key bottlenecks in generation guidance, preference modeling, and policy alignment. Contribution/Results: We introduce principled RM selection criteria and robustness-enhancing strategies, achieving significant improvements in accuracy and decision consistency on complex reasoning tasks—including mathematical and symbolic reasoning. Our work establishes a reproducible methodology and practical guidelines for RM-driven trustworthy reasoning, advancing both theoretical understanding and empirical deployment of RMs in LLM reasoning pipelines.

Technology Category

Application Category

📝 Abstract
Reward models (RMs) play a critical role in enhancing the reasoning performance of LLMs. For example, they can provide training signals to finetune LLMs during reinforcement learning (RL) and help select the best answer from multiple candidates during inference. In this paper, we provide a systematic introduction to RMs, along with a comprehensive survey of their applications in LLM reasoning. We first review fundamental concepts of RMs, including their architectures, training methodologies, and evaluation techniques. Then, we explore their key applications: (1) guiding generation and selecting optimal outputs during LLM inference, (2) facilitating data synthesis and iterative self-improvement for LLMs, and (3) providing training signals in RL-based finetuning. Finally, we address critical open questions regarding the selection, generalization, evaluation, and enhancement of RMs, based on existing research and our own empirical findings. Our analysis aims to provide actionable insights for the effective deployment and advancement of RMs for LLM reasoning.
Problem

Research questions and friction points this paper is trying to address.

Surveying reward models' role in enhancing LLM reasoning capabilities
Exploring how reward models guide output selection during LLM inference
Addressing open questions about reward model selection and generalization
Innovation

Methods, ideas, or system contributions that make the work stand out.

Reward models guide LLM generation and output selection
Reward models enable data synthesis and iterative self-improvement
Reward models provide training signals for RL-based fine-tuning
Qiyuan Liu
Qiyuan Liu
Department of Data Science and Hong Kong Institute of AI for Science, City University of Hong Kong
H
Hao Xu
Li Auto Inc., China
X
Xuhong Chen
Li Auto Inc., China
W
Wei Chen
Li Auto Inc., China
Yee Whye Teh
Yee Whye Teh
Professor of Statistical Machine Learning, Oxford, Research Scientist, DeepMind
Machine LearningArtificial IntelligenceStatisticsComputer Science
N
Ning Miao
Department of Data Science and Hong Kong Institute of AI for Science, City University of Hong Kong