Generalist Reward Models: Found Inside Large Language Models

πŸ“… 2025-06-29
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
Large language models (LLMs) rely on costly human preference data to train reward models for alignment, hindering scalability and efficiency. Method: We propose an endogenous reward extraction method that requires no additional training, fine-tuning, human annotation, or AI feedback. Grounded in theoretical analysis, we show that the implicit reward function encoded in pretrained autoregressive LMs is equivalent to the solution of offline inverse reinforcement learning (IRL) and admits a tighter policy error bound. Our approach directly decodes this intrinsic reward signal via principled, analytical derivation from standard LM outputs. Results: Empirical evaluation across multiple alignment benchmarks demonstrates that our method outperforms mainstream LLM-as-a-judge approaches and matches or exceeds the performance of explicitly trained reward modelsβ€”while eliminating dependence on preference data. This yields substantial gains in alignment efficiency, scalability, and theoretical rigor.

Technology Category

Application Category

πŸ“ Abstract
The alignment of Large Language Models (LLMs) is critically dependent on reward models trained on costly human preference data. While recent work explores bypassing this cost with AI feedback, these methods often lack a rigorous theoretical foundation. In this paper, we discover that a powerful generalist reward model is already latently present within any LLM trained via standard next-token prediction. We prove that this endogenous reward is not a heuristic, but is theoretically equivalent to a reward function learned through offline inverse reinforcement learning. This connection allows us to directly elicit a high-quality reward signal from a base (pre-trained or supervised fine-tuned) model without any further training. Critically, we also prove that subsequent reinforcement learning using this endogenous reward leads to a policy with a provably superior error bound compared to the base model. To our best knowledge, this is the first theoretical proof of the effectiveness of reinforcement learning for LLMs. Our experiments validate this theory, demonstrating that our method not only outperforms existing LLM-as-a-judge approaches but can also surpass explicitly trained reward models. These findings suggest that the reward modeling stage can be replaced by a principled method of eliciting the knowledge already captured during pre-training, heralding a more efficient, powerful, and scalable paradigm for LLMs alignment as well as multi-modal models.
Problem

Research questions and friction points this paper is trying to address.

Discover latent generalist reward models in LLMs
Prove equivalence to inverse reinforcement learning rewards
Enable superior RL-based alignment without explicit training
Innovation

Methods, ideas, or system contributions that make the work stand out.

Utilizes latent reward model in LLMs
Links reward to inverse reinforcement learning
Elicits reward signal without additional training
πŸ”Ž Similar Papers
No similar papers found.
Yi-Chen Li
Yi-Chen Li
Nanjing University
Reinforcement LearningImitation LearningRLHF
T
Tian Xu
National Key Laboratory for Novel Software Technology, Nanjing University, China
Y
Yang Yu
National Key Laboratory for Novel Software Technology, Nanjing University, China; School of Artificial Intelligence, Nanjing University, China
X
Xuqin Zhang
National Key Laboratory for Novel Software Technology, Nanjing University, China
X
Xiong-Hui Chen
National Key Laboratory for Novel Software Technology, Nanjing University, China
Z
Zhongxiang Ling
National Key Laboratory for Novel Software Technology, Nanjing University, China
N
Ningjing Chao
National Key Laboratory for Novel Software Technology, Nanjing University, China
L
Lei Yuan
National Key Laboratory for Novel Software Technology, Nanjing University, China
Zhi-Hua Zhou
Zhi-Hua Zhou
Nanjing University
Artificial IntelligenceMachine LearningData Mining