Semi-Supervised Preference Optimization with Limited Feedback

📅 2025-10-27
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Preference alignment faces a semi-supervised learning challenge characterized by scarce pairwise-labeled data but abundant unlabeled samples. Method: This paper proposes Semi-Supervised Preference Optimization (SSPO), which theoretically establishes—under separable win/loss responses—the existence and uniqueness of an optimal reward threshold. Leveraging this insight, SSPO introduces a principled pseudo-labeling mechanism that uses minimal pairwise annotations to generate high-quality pseudo-preference labels for large-scale unlabeled data. Contribution/Results: SSPO establishes the first theoretical foundation for low-resource preference alignment and provides a provably optimal pseudo-labeling criterion. Empirical results demonstrate that, when training Llama3-8B-Instruct with only 1% of UltraFeedback, SSPO consistently outperforms strong baselines trained on 10% of the same data—achieving comparable alignment performance while substantially reducing annotation cost.

Technology Category

Application Category

📝 Abstract
The field of preference optimization has made outstanding contributions to the alignment of language models with human preferences. Despite these advancements, recent methods still rely heavily on substantial paired (labeled) feedback data, leading to substantial resource expenditures. To address these challenges, we study the problem of Semi-Supervised Preference Optimization (SSPO) in which the idea is to learn from both a small number of pairwise preference labels and a large pool of unpaired samples simultaneously. Our key theoretical contribution proves the existence of an optimal reward threshold capable of separating winning and losing responses with high probability, which enables a principled pseudo-labeling of unpaired data. By leveraging these pseudo-labels, SSPO effectively distills latent preferences from large-scale unpaired data, thus maintaining human alignment while drastically reducing acquisition costs. Extensive experiments across datasets validate this remarkable data efficiency; for instance, SSPO trained with Llama3-8B-Instruct on just 1% of UltraFeedback consistently surpasses strong baselines trained on 10% of UltraFeedback.
Problem

Research questions and friction points this paper is trying to address.

Optimizing language model alignment with limited human feedback
Reducing reliance on costly labeled preference data
Leveraging unlabeled data through principled pseudo-labeling approach
Innovation

Methods, ideas, or system contributions that make the work stand out.

Semi-supervised learning from limited paired feedback
Optimal reward threshold for pseudo-labeling unpaired data
Distilling latent preferences from large unpaired datasets
🔎 Similar Papers
No similar papers found.
S
Seonggyun Lee
Yonsei University
Sungjun Lim
Sungjun Lim
Yonsei University
Bayesian Neural NetworkOptimizationModel Merging
S
Seojin Park
Yonsei University
S
Soeun Cheon
Korea Advanced Institute of Science and Technology
Kyungwoo Song
Kyungwoo Song
Yonsei University
Machine LearningDeep LearningNeural Networks