Generative Verifiers: Reward Modeling as Next-Token Prediction

๐Ÿ“… 2024-08-27
๐Ÿ›๏ธ arXiv.org
๐Ÿ“ˆ Citations: 55
โœจ Influential: 1
๐Ÿ“„ PDF
๐Ÿค– AI Summary
To address the limited reasoning capability of large language model (LLM)-based verifiers, this paper proposes Generative Reward Modeling (GenRM), reframing reward modeling as an autoregressive next-token prediction taskโ€”thereby unifying verification and problem solving within the LLMโ€™s native generative paradigm. Methodologically, GenRM constructs an end-to-end generative verifier from pretrained LLMs, integrating synthetic verification rationale supervision, Best-of-N sampling, and test-time majority voting to support seamless integration with chain-of-thought reasoning and instruction tuning. Empirical evaluation shows GenRM improves problem-solving accuracy by 16โ€“40% over discriminative models, DPO, and LLM-as-a-judge baselines on algorithmic and mathematical reasoning tasks; critically, it achieves fine-grained mathematical error detection using synthetic rationales alone. The core contribution is the first complete normalization of verification into the LLMโ€™s next-token prediction framework, establishing a unified generative foundation for reward modeling and reasoning validation.

Technology Category

Application Category

๐Ÿ“ Abstract
Verifiers or reward models are often used to enhance the reasoning performance of large language models (LLMs). A common approach is the Best-of-N method, where N candidate solutions generated by the LLM are ranked by a verifier, and the best one is selected. While LLM-based verifiers are typically trained as discriminative classifiers to score solutions, they do not utilize the text generation capabilities of pretrained LLMs. To overcome this limitation, we instead propose training verifiers using the ubiquitous next-token prediction objective, jointly on verification and solution generation. Compared to standard verifiers, such generative verifiers (GenRM) can benefit from several advantages of LLMs: they integrate seamlessly with instruction tuning, enable chain-of-thought reasoning, and can utilize additional test-time compute via majority voting for better verification. We demonstrate that GenRM outperforms discriminative, DPO verifiers, and LLM-as-a-Judge, resulting in a 16-40% improvement in the number of problems solved with Best-of-N on algorithmic and math reasoning tasks. Furthermore, we find that training GenRM with synthetic verification rationales is sufficient to pick out subtle errors on math problems. Finally, we demonstrate that generative verifiers scale favorably with model size and inference-time compute.
Problem

Research questions and friction points this paper is trying to address.

Enhance LLM reasoning with generative verifiers.
Propose next-token prediction for verifier training.
Improve verification via majority voting integration.
Innovation

Methods, ideas, or system contributions that make the work stand out.

Generative verifiers use next-token prediction
Joint training for verification and generation
Improves performance with majority voting