๐ค AI Summary
To address the limited reasoning capability of large language model (LLM)-based verifiers, this paper proposes Generative Reward Modeling (GenRM), reframing reward modeling as an autoregressive next-token prediction taskโthereby unifying verification and problem solving within the LLMโs native generative paradigm. Methodologically, GenRM constructs an end-to-end generative verifier from pretrained LLMs, integrating synthetic verification rationale supervision, Best-of-N sampling, and test-time majority voting to support seamless integration with chain-of-thought reasoning and instruction tuning. Empirical evaluation shows GenRM improves problem-solving accuracy by 16โ40% over discriminative models, DPO, and LLM-as-a-judge baselines on algorithmic and mathematical reasoning tasks; critically, it achieves fine-grained mathematical error detection using synthetic rationales alone. The core contribution is the first complete normalization of verification into the LLMโs next-token prediction framework, establishing a unified generative foundation for reward modeling and reasoning validation.
๐ Abstract
Verifiers or reward models are often used to enhance the reasoning performance of large language models (LLMs). A common approach is the Best-of-N method, where N candidate solutions generated by the LLM are ranked by a verifier, and the best one is selected. While LLM-based verifiers are typically trained as discriminative classifiers to score solutions, they do not utilize the text generation capabilities of pretrained LLMs. To overcome this limitation, we instead propose training verifiers using the ubiquitous next-token prediction objective, jointly on verification and solution generation. Compared to standard verifiers, such generative verifiers (GenRM) can benefit from several advantages of LLMs: they integrate seamlessly with instruction tuning, enable chain-of-thought reasoning, and can utilize additional test-time compute via majority voting for better verification. We demonstrate that GenRM outperforms discriminative, DPO verifiers, and LLM-as-a-Judge, resulting in a 16-40% improvement in the number of problems solved with Best-of-N on algorithmic and math reasoning tasks. Furthermore, we find that training GenRM with synthetic verification rationales is sufficient to pick out subtle errors on math problems. Finally, we demonstrate that generative verifiers scale favorably with model size and inference-time compute.