🤖 AI Summary
High-quality preference data for Reinforcement Learning from Human Feedback (RLHF) is scarce and expensive to obtain via manual annotation.
Method: This paper proposes Scalable Preference Model Pretraining (PMP), a novel paradigm that leverages large-scale public source code to automatically construct code–preference pairs, enabling unsupervised pretraining of reward models and substantially reducing reliance on human annotations. The method comprises synthetic preference data–driven supervised pretraining, followed by reward modeling fine-tuning.
Contribution/Results: We systematically evaluate PMP on mathematical and logical reasoning benchmarks—including GSM8K, MATH, ReClor, and LogiQA2.0—demonstrating consistent improvements in LLM reasoning performance. Our results validate that unsupervised preference modeling grounded in code priors serves as a critical enabler for efficient, scalable reward learning, offering a practical pathway toward more robust and cost-effective RLHF pipelines.
📝 Abstract
Large language models (LLMs) have made significant progress in natural language understanding and generation, driven by scalable pretraining and advanced finetuning. However, enhancing reasoning abilities in LLMs, particularly via reinforcement learning from human feedback (RLHF), remains challenging due to the scarcity of high-quality preference data, which is labor-intensive to annotate and crucial for reward model (RM) finetuning. To alleviate this issue, we introduce CodePMP, a scalable preference model pretraining (PMP) pipeline that utilizes a large corpus of synthesized code-preference pairs from publicly available high-quality source code. CodePMP improves RM finetuning efficiency by pretraining preference models on large-scale synthesized code-preference pairs. We evaluate CodePMP on mathematical reasoning tasks (GSM8K, MATH) and logical reasoning tasks (ReClor, LogiQA2.0), consistently showing significant improvements in reasoning performance of LLMs and highlighting the importance of scalable preference model pretraining for efficient reward modeling.