Self-Rewarding Language Models

📅 2024-01-18

🏛️ International Conference on Machine Learning

📈 Citations: 264

✨ Influential: 22

career value

214K/year

🤖 AI Summary

Existing language model reward modeling relies on human feedback, which is constrained by annotation quality and static reward functions, limiting performance gains. Method: We propose an “LLM-as-a-Judge” self-rewarding framework wherein a large language model autonomously generates high-quality, evolvable reward signals during training—eliminating dependence on human preferences and fixed reward models. Specifically, we adapt Llama 2 70B via iterative Direct Preference Optimization (DPO) coupled with judge-role-oriented prompt engineering, enabling three rounds of self-feedback fine-tuning. Results: Our model surpasses Claude 2, Gemini Pro, and GPT-4 0613 on AlpacaEval 2.0—the first demonstration of human-annotation-free reward learning with continuously improving reward capability throughout training. This establishes a novel paradigm for developing autonomous intelligent agents that potentially exceed human-level performance.

Technology Category

Application Category

📝 Abstract

We posit that to achieve superhuman agents, future models require superhuman feedback in order to provide an adequate training signal. Current approaches commonly train reward models from human preferences, which may then be bottlenecked by human performance level, and secondly these separate frozen reward models cannot then learn to improve during LLM training. In this work, we study Self-Rewarding Language Models, where the language model itself is used via LLM-as-a-Judge prompting to provide its own rewards during training. We show that during Iterative DPO training that not only does instruction following ability improve, but also the ability to provide high-quality rewards to itself. Fine-tuning Llama 2 70B on three iterations of our approach yields a model that outperforms many existing systems on the AlpacaEval 2.0 leaderboard, including Claude 2, Gemini Pro, and GPT-4 0613. While there is much left still to explore, this work opens the door to the possibility of models that can continually improve in both axes.

Problem

Research questions and friction points this paper is trying to address.

Overcoming human performance bottleneck in reward models

Integrating self-rewarding mechanisms during LLM training

Enhancing both instruction following and reward generation capabilities

Innovation

Methods, ideas, or system contributions that make the work stand out.

Self-rewarding via LLM-as-a-Judge prompting

Iterative DPO training for self-improvement

Fine-tuning Llama 2 70B for superior performance

🔎 Similar Papers

No similar papers found.