SGPO: Self-Generated Preference Optimization based on Self-Improver

📅 2025-07-27
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address the reliance of large language models (LLMs) on human-annotated preference data—and the consequent distributional shift—this paper proposes Self-Generated Preference Optimization (SGPO), a framework that unifies the policy model and a self-improvement module. Initialized via supervised fine-tuning, SGPO iteratively generates responses using the policy itself and refines it end-to-end via direct preference optimization, autonomously constructing high-quality preference data without external supervision. By eliminating human annotation entirely, SGPO avoids off-policy training bias. Empirically, it achieves state-of-the-art performance on AlpacaEval 2.0 and Arena-Hard, significantly outperforming DPO and existing self-improvement baselines, thereby validating the efficacy and advancement of unsupervised preference learning. Its core innovations are a closed-loop “policy-as-improver” architecture and a progressive, self-generated preference mechanism.

Technology Category

Application Category

📝 Abstract
Large language models (LLMs), despite their extensive pretraining on diverse datasets, require effective alignment to human preferences for practical and reliable deployment. Conventional alignment methods typically employ off-policy learning and depend on human-annotated datasets, which limits their broad applicability and introduces distribution shift issues during training. To address these challenges, we propose Self-Generated Preference Optimization based on Self-Improver (SGPO), an innovative alignment framework that leverages an on-policy self-improving mechanism. Specifically, the improver refines responses from a policy model to self-generate preference data for direct preference optimization (DPO) of the policy model. Here, the improver and policy are unified into a single model, and in order to generate higher-quality preference data, this self-improver learns to make incremental yet discernible improvements to the current responses by referencing supervised fine-tuning outputs. Experimental results on AlpacaEval 2.0 and Arena-Hard show that the proposed SGPO significantly improves performance over DPO and baseline self-improving methods without using external preference data.
Problem

Research questions and friction points this paper is trying to address.

Aligning LLMs with human preferences effectively
Reducing reliance on human-annotated datasets
Improving response quality via self-generated preferences
Innovation

Methods, ideas, or system contributions that make the work stand out.

On-policy self-improving alignment framework
Self-generated preference data for optimization
Unified model for improver and policy
🔎 Similar Papers
No similar papers found.
H
Hyeonji Lee
Korea University, Republic of Korea
D
Daejin Jo
Korea University, Republic of Korea
S
Seohwan Yun
Korea University, Republic of Korea
Sungwoong Kim
Sungwoong Kim
Associate Professor, Korea University
artificial general intelligence