DuaShepherd: Integrating Stepwise Correctness and Potential Rewards for Mathematical Reasoning

📅 2025-06-20
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing large language models (LLMs) for mathematical reasoning suffer from a disconnect between modeling step-wise correctness and final answer success probability. Method: This paper proposes DuaShepherd, a novel framework featuring a correctness-potential dual-signal co-modeling mechanism. It introduces a large-scale, human-verified reward dataset annotated with both step-wise correctness and potential success probability, and designs a unified multi-head, multi-task architecture to jointly optimize these two signals, augmented by a composite probability fusion strategy. Contribution/Results: Evaluated via an automated reward data pipeline and a rigorous RL benchmark—MATH500 and ProcessBench—DuaShepherd achieves state-of-the-art performance in mathematical reasoning, significantly outperforming single-signal baselines under equivalent computational resources.

Technology Category

Application Category

📝 Abstract
In this paper, we propose DuaShepherd, a novel reward modeling framework that integrates two complementary reward signals, correctness and potential, to enhance the mathematical reasoning capabilities of Large Language Models (LLMs). While correctness-based signals emphasize identification of stepwise errors, potential-based signals focus on the likelihood of reaching the correct final answer. We developed an automated pipeline for constructing large-scale reward modeling dataset with both signals. A unified, multi-head architecture was explored to train the two reward models in a multi-task setup, demonstrating benefits from learning both correctness and potential in parallel. By combining these two signals into a compound probability, our model achieves consistent performance improvements across multiple benchmarks. Empirical evaluations on MATH500 and ProcessBench confirm that this combined reward significantly outperforms models trained on either reward type alone, achieving state-of-the-art performance under comparable resource constraints.
Problem

Research questions and friction points this paper is trying to address.

Enhancing LLMs' math reasoning via correctness and potential rewards
Automating dataset creation for dual-reward modeling in math tasks
Unifying correctness and potential rewards for improved model performance
Innovation

Methods, ideas, or system contributions that make the work stand out.

Integrates correctness and potential reward signals
Automated pipeline for large-scale dataset construction
Multi-head architecture for multi-task reward training
🔎 Similar Papers
No similar papers found.
Yuanhao Wu
Yuanhao Wu
Newsbreak
NLPLLMs
J
Juntong Song
NewsBreak
Hanning Zhang
Hanning Zhang
University of Illinois at Urbana-Champaign
Natural Language ProcessingLarge Language Models
T
Tong Zhang
University of Illinois Urbana-Champaign
C
Cheng Niu
NewsBreak