Enhancing LLM Reasoning with Iterative DPO: A Comprehensive Empirical Investigation

📅 2025-03-17
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address the prohibitively high computational cost of enhancing large language models’ (LLMs) mathematical reasoning capabilities, this paper proposes DPO-VP: an iterative online post-training framework grounded in Direct Preference Optimization (DPO). DPO-VP tightly couples a generator and a reward model to enable bidirectional co-optimization, and introduces a verifiable reward mechanism that dynamically filters preference data—achieving substantial performance gains for strong base models even after a single coarse-grained filtering pass. Experiments demonstrate that DPO-VP matches reinforcement learning (RL)-based methods in mathematical reasoning accuracy while drastically reducing training overhead. This work constitutes the first systematic validation of DPO as a computationally efficient, scalable alternative to RL for LLM reasoning enhancement. It establishes a new paradigm for low-cost, high-robustness LLM reasoning improvement, advancing both methodological rigor and practical deployability.

Technology Category

Application Category

📝 Abstract
Recent advancements in post-training methodologies for large language models (LLMs) have highlighted reinforcement learning (RL) as a critical component for enhancing reasoning. However, the substantial computational costs associated with RL-based approaches have led to growing interest in alternative paradigms, such as Direct Preference Optimization (DPO). In this study, we investigate the effectiveness of DPO in facilitating self-improvement for LLMs through iterative preference-based learning. We demonstrate that a single round of DPO with coarse filtering significantly enhances mathematical reasoning performance, particularly for strong base model. Furthermore, we design an iterative enhancement framework for both the generator and the reward model (RM), enabling their mutual improvement through online interaction across multiple rounds of DPO. Finally, with simple verifiable rewards, our model DPO-VP achieves RL-level performance with significantly lower computational overhead. These findings highlight DPO as a scalable and cost-effective alternative to RL, offering a practical solution for enhancing LLM reasoning in resource-constrained situations.
Problem

Research questions and friction points this paper is trying to address.

Investigates DPO's effectiveness in enhancing LLM reasoning.
Proposes iterative DPO framework for mutual model improvement.
Achieves RL-level performance with lower computational costs.
Innovation

Methods, ideas, or system contributions that make the work stand out.

Iterative DPO enhances LLM reasoning efficiently.
Coarse filtering boosts mathematical reasoning performance.
DPO-VP achieves RL-level performance with lower cost.
🔎 Similar Papers
No similar papers found.
Songjun Tu
Songjun Tu
Institute of Automation, Chinese Academy of Sciences; Pengcheng Laboratory
Large Language ModelsReinforecement Learning
J
Jiahao Lin
Institute of Automation, Chinese Academy of Sciences; Pengcheng Laboratory; School of Artificial Intelligence, University of Chinese Academy of Sciences
X
Xiangyu Tian
Institute of Automation, Chinese Academy of Sciences; Pengcheng Laboratory; School of Artificial Intelligence, University of Chinese Academy of Sciences
Qichao Zhang
Qichao Zhang
中国科学院自动化研究所
人工智能 强化学习 博弈论 自适应动态规划
L
Linjing Li
Institute of Automation, Chinese Academy of Sciences; Pengcheng Laboratory; School of Artificial Intelligence, University of Chinese Academy of Sciences
Y
Yuqian Fu
Institute of Automation, Chinese Academy of Sciences; Pengcheng Laboratory; School of Artificial Intelligence, University of Chinese Academy of Sciences
N
Nan Xu
Wenge Technology
W
Wei He
Fudan University
Xiangyuan Lan
Xiangyuan Lan
Pengcheng Laboratory
Multimodal LLMPlace RecognitionVisual TrackingPerson Re-identificationObject Detection
Dongmei Jiang
Dongmei Jiang
Northwestern Polytechnical University; Peng Cheng Laboratory
Affective ComputingMultimodal emotion recognitionMultimodal mental state evaluation
Dongbin Zhao
Dongbin Zhao
Institute of Automation, Chinese Academy of Sciences
Deep Reinforcement LearningAdaptive Dynamic ProgrammingGame AISmart drivingrobotics