Enhancing LLM Reasoning with Iterative DPO: A Comprehensive Empirical Investigation

📅 2025-03-17

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

To address the prohibitively high computational cost of enhancing large language models’ (LLMs) mathematical reasoning capabilities, this paper proposes DPO-VP: an iterative online post-training framework grounded in Direct Preference Optimization (DPO). DPO-VP tightly couples a generator and a reward model to enable bidirectional co-optimization, and introduces a verifiable reward mechanism that dynamically filters preference data—achieving substantial performance gains for strong base models even after a single coarse-grained filtering pass. Experiments demonstrate that DPO-VP matches reinforcement learning (RL)-based methods in mathematical reasoning accuracy while drastically reducing training overhead. This work constitutes the first systematic validation of DPO as a computationally efficient, scalable alternative to RL for LLM reasoning enhancement. It establishes a new paradigm for low-cost, high-robustness LLM reasoning improvement, advancing both methodological rigor and practical deployability.

Technology Category

Application Category

📝 Abstract

Recent advancements in post-training methodologies for large language models (LLMs) have highlighted reinforcement learning (RL) as a critical component for enhancing reasoning. However, the substantial computational costs associated with RL-based approaches have led to growing interest in alternative paradigms, such as Direct Preference Optimization (DPO). In this study, we investigate the effectiveness of DPO in facilitating self-improvement for LLMs through iterative preference-based learning. We demonstrate that a single round of DPO with coarse filtering significantly enhances mathematical reasoning performance, particularly for strong base model. Furthermore, we design an iterative enhancement framework for both the generator and the reward model (RM), enabling their mutual improvement through online interaction across multiple rounds of DPO. Finally, with simple verifiable rewards, our model DPO-VP achieves RL-level performance with significantly lower computational overhead. These findings highlight DPO as a scalable and cost-effective alternative to RL, offering a practical solution for enhancing LLM reasoning in resource-constrained situations.

Problem

Research questions and friction points this paper is trying to address.

Investigates DPO's effectiveness in enhancing LLM reasoning.

Proposes iterative DPO framework for mutual model improvement.

Achieves RL-level performance with lower computational costs.

Innovation

Methods, ideas, or system contributions that make the work stand out.

Iterative DPO enhances LLM reasoning efficiently.

Coarse filtering boosts mathematical reasoning performance.

DPO-VP achieves RL-level performance with lower cost.

🔎 Similar Papers

No similar papers found.

Authors to Follow