CodePMP: Scalable Preference Model Pretraining for Large Language Model Reasoning

📅 2024-10-03

🏛️ arXiv.org

📈 Citations: 5

✨ Influential: 0

career value

167K/year

🤖 AI Summary

High-quality preference data for Reinforcement Learning from Human Feedback (RLHF) is scarce and expensive to obtain via manual annotation. Method: This paper proposes Scalable Preference Model Pretraining (PMP), a novel paradigm that leverages large-scale public source code to automatically construct code–preference pairs, enabling unsupervised pretraining of reward models and substantially reducing reliance on human annotations. The method comprises synthetic preference data–driven supervised pretraining, followed by reward modeling fine-tuning. Contribution/Results: We systematically evaluate PMP on mathematical and logical reasoning benchmarks—including GSM8K, MATH, ReClor, and LogiQA2.0—demonstrating consistent improvements in LLM reasoning performance. Our results validate that unsupervised preference modeling grounded in code priors serves as a critical enabler for efficient, scalable reward learning, offering a practical pathway toward more robust and cost-effective RLHF pipelines.

Technology Category

Application Category

📝 Abstract

Large language models (LLMs) have made significant progress in natural language understanding and generation, driven by scalable pretraining and advanced finetuning. However, enhancing reasoning abilities in LLMs, particularly via reinforcement learning from human feedback (RLHF), remains challenging due to the scarcity of high-quality preference data, which is labor-intensive to annotate and crucial for reward model (RM) finetuning. To alleviate this issue, we introduce CodePMP, a scalable preference model pretraining (PMP) pipeline that utilizes a large corpus of synthesized code-preference pairs from publicly available high-quality source code. CodePMP improves RM finetuning efficiency by pretraining preference models on large-scale synthesized code-preference pairs. We evaluate CodePMP on mathematical reasoning tasks (GSM8K, MATH) and logical reasoning tasks (ReClor, LogiQA2.0), consistently showing significant improvements in reasoning performance of LLMs and highlighting the importance of scalable preference model pretraining for efficient reward modeling.

Problem

Research questions and friction points this paper is trying to address.

Addresses scarcity of high-quality preference data for LLM reasoning

Enhances reward model finetuning via scalable pretraining on code-preference pairs

Improves LLM reasoning performance on mathematical and logical tasks

Innovation

Methods, ideas, or system contributions that make the work stand out.

Pretrains preference models using synthesized code-preference pairs

Enhances reward model finetuning efficiency

Improves LLM reasoning on math and logic tasks

🔎 Similar Papers

No similar papers found.