Self-Rewarding PPO: Aligning Large Language Models with Demonstrations Only

📅 2025-10-23
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Supervised fine-tuning (SFT) is prone to overfitting and exhibits poor generalization under low-resource conditions. To address this, we propose Self-Rewarding PPO—a fully online alignment method that eliminates the need for human preference annotations. It constructs an implicit reward function based on the log-ratio between the SFT policy and the pretrained policy, thereby seamlessly integrating supervised signals into reinforcement learning. Crucially, alignment and generalization are jointly optimized in a single training phase, substantially mitigating performance degradation under data scarcity. Experiments across multiple NLP benchmarks demonstrate that our method consistently outperforms standard SFT, maintaining strong robustness and cross-domain generalization even with only a few demonstration examples. By decoupling alignment from costly human feedback and enabling efficient, data-efficient optimization, Self-Rewarding PPO establishes a new paradigm for scalable, low-dependency large language model alignment.

Technology Category

Application Category

📝 Abstract
Supervised fine-tuning (SFT) has emerged as a crucial method for aligning large language models (LLMs) with human-annotated demonstrations. However, SFT, being an off-policy approach similar to behavior cloning, often struggles with overfitting and poor out-of-domain generalization, especially in limited-data scenarios. To address these limitations, we propose Self-Rewarding PPO, a novel fine-tuning method that leverages on-policy techniques to enhance generalization performance. Our approach combines the strengths of SFT and proximal policy optimization (PPO) to achieve more effective alignment from demonstration data. At its core is a reward function designed as the log policy ratio between the SFT model and the pretrained base model. This function serves as an implicit reward signal, using the pretrained policy as a baseline and the SFT policy as a target. By doing so, it enables on-policy fine-tuning without relying on human preference annotations. The integration of this self-rewarding mechanism with PPO addresses key limitations of SFT, improving generalization, data efficiency, and robustness. Our empirical evaluation across a range of natural language processing tasks demonstrates that Self-Rewarding PPO consistently outperforms traditional SFT methods. The results highlight the effectiveness of our approach in aligning LLMs using demonstration data, particularly in scenarios where high-quality annotated data is scarce.
Problem

Research questions and friction points this paper is trying to address.

Improves generalization in language model alignment without human preferences
Addresses overfitting and poor out-of-domain generalization in limited data
Enables on-policy fine-tuning using demonstrations instead of preference annotations
Innovation

Methods, ideas, or system contributions that make the work stand out.

Combines SFT and PPO for on-policy fine-tuning
Uses log policy ratio as implicit reward signal
Enables alignment without human preference annotations