Full-Step-DPO: Self-Supervised Preference Optimization with Step-wise Rewards for Mathematical Reasoning

📅 2025-02-20

📈 Citations: 0

✨ Influential: 0

career value

182K/year

🤖 AI Summary

Existing direct preference optimization (DPO) methods—e.g., Step-DPO—are limited in long-chain mathematical reasoning: they focus only on the first erroneous step, rely on manual or GPT-4–generated annotations for error localization, and lack fine-grained process-level supervision. Method: We propose Full-step Dynamic DPO, the first DPO framework incorporating self-supervised process reward modeling, which automatically generates learnable, step-wise rewards for every reasoning step. It introduces a step-weighted DPO loss to enable end-to-end optimization of reasoning trajectory quality. Contribution/Results: Our method requires no external annotations. On domain-specific (MathQA, AMC23) and cross-domain (AIME) mathematical benchmarks, it consistently surpasses state-of-the-art methods. It significantly improves both reasoning accuracy and robustness of foundational models—including LLaMA-3 and Qwen2—demonstrating superior generalization across diverse mathematical problem types and difficulty levels.

Technology Category

Application Category

📝 Abstract

Direct Preference Optimization (DPO) often struggles with long-chain mathematical reasoning. Existing approaches, such as Step-DPO, typically improve this by focusing on the first erroneous step in the reasoning chain. However, they overlook all other steps and rely heavily on humans or GPT-4 to identify erroneous steps. To address these issues, we propose Full-Step-DPO, a novel DPO framework tailored for mathematical reasoning. Instead of optimizing only the first erroneous step, it leverages step-wise rewards from the entire reasoning chain. This is achieved by training a self-supervised process reward model, which automatically scores each step, providing rewards while avoiding reliance on external signals. Furthermore, we introduce a novel step-wise DPO loss, which dynamically updates gradients based on these step-wise rewards. This endows stronger reasoning capabilities to language models. Extensive evaluations on both in-domain and out-of-domain mathematical reasoning benchmarks across various base language models, demonstrate that Full-Step-DPO achieves superior performance compared to state-of-the-art baselines.

Problem

Research questions and friction points this paper is trying to address.

Optimizes long-chain mathematical reasoning

Self-supervised step-wise reward model

Enhances language model reasoning capabilities

Innovation

Methods, ideas, or system contributions that make the work stand out.

Self-supervised step-wise rewards

Dynamic step-wise DPO loss

Automatic step scoring model

🔎 Similar Papers

No similar papers found.