CodePMP: Scalable Preference Model Pretraining for Large Language Model Reasoning

📅 2024-10-03
🏛️ arXiv.org
📈 Citations: 5
Influential: 0
📄 PDF
🤖 AI Summary
High-quality preference data for Reinforcement Learning from Human Feedback (RLHF) is scarce and expensive to obtain via manual annotation. Method: This paper proposes Scalable Preference Model Pretraining (PMP), a novel paradigm that leverages large-scale public source code to automatically construct code–preference pairs, enabling unsupervised pretraining of reward models and substantially reducing reliance on human annotations. The method comprises synthetic preference data–driven supervised pretraining, followed by reward modeling fine-tuning. Contribution/Results: We systematically evaluate PMP on mathematical and logical reasoning benchmarks—including GSM8K, MATH, ReClor, and LogiQA2.0—demonstrating consistent improvements in LLM reasoning performance. Our results validate that unsupervised preference modeling grounded in code priors serves as a critical enabler for efficient, scalable reward learning, offering a practical pathway toward more robust and cost-effective RLHF pipelines.

Technology Category

Application Category

📝 Abstract
Large language models (LLMs) have made significant progress in natural language understanding and generation, driven by scalable pretraining and advanced finetuning. However, enhancing reasoning abilities in LLMs, particularly via reinforcement learning from human feedback (RLHF), remains challenging due to the scarcity of high-quality preference data, which is labor-intensive to annotate and crucial for reward model (RM) finetuning. To alleviate this issue, we introduce CodePMP, a scalable preference model pretraining (PMP) pipeline that utilizes a large corpus of synthesized code-preference pairs from publicly available high-quality source code. CodePMP improves RM finetuning efficiency by pretraining preference models on large-scale synthesized code-preference pairs. We evaluate CodePMP on mathematical reasoning tasks (GSM8K, MATH) and logical reasoning tasks (ReClor, LogiQA2.0), consistently showing significant improvements in reasoning performance of LLMs and highlighting the importance of scalable preference model pretraining for efficient reward modeling.
Problem

Research questions and friction points this paper is trying to address.

Addresses scarcity of high-quality preference data for LLM reasoning
Enhances reward model finetuning via scalable pretraining on code-preference pairs
Improves LLM reasoning performance on mathematical and logical tasks
Innovation

Methods, ideas, or system contributions that make the work stand out.

Pretrains preference models using synthesized code-preference pairs
Enhances reward model finetuning efficiency
Improves LLM reasoning on math and logic tasks
🔎 Similar Papers
No similar papers found.
H
Huimu Yu
Institute of Information Engineering, Chinese Academy of Sciences; School of Cyber Security, University of Chinese Academy of Sciences
X
Xing Wu
Institute of Information Engineering, Chinese Academy of Sciences; School of Cyber Security, University of Chinese Academy of Sciences; Xiaohongshu Inc
W
Weidong Yin
Independent Researcher
Debing Zhang
Debing Zhang
Xiaohongshu
Machine LearningComputer VisionDeep Learning
S
Songlin Hu
Institute of Information Engineering, Chinese Academy of Sciences; School of Cyber Security, University of Chinese Academy of Sciences