Better Process Supervision with Bi-directional Rewarding Signals

๐Ÿ“… 2025-03-06
๐Ÿ“ˆ Citations: 0
โœจ Influential: 0
๐Ÿ“„ PDF
๐Ÿค– AI Summary
Existing process supervision methods (e.g., PRM) provide only unidirectional, local step-level rewards, ignoring both accumulated execution cost and remaining distance to the goalโ€”limiting their effectiveness for complex, long-horizon reasoning. To address this, we propose Bidirectional Reward Modeling (BiRM), the first approach to incorporate A*-style heuristic principles into LLM process supervision. BiRM jointly models forward execution cost and backward goal distance, enabling globally consistent, goal-aware reasoning guidance. Our method comprises: (1) a dual-directional reward signal design, (2) a process-level supervision training framework, and (3) an inference strategy integrating Best-of-N sampling with heuristic search. On Gaokao2023, BiRM achieves a 3.1% absolute accuracy gain; on MATH-500, it outperforms ORM and PRM by 5.0% and 3.8%, respectively, in solution success rate. Empirical results demonstrate substantially improved stability and success in long-range reasoning tasks.

Technology Category

Application Category

๐Ÿ“ Abstract
Process supervision, i.e., evaluating each step, is critical for complex large language model (LLM) reasoning and test-time searching with increased inference compute. Existing approaches, represented by process reward models (PRMs), primarily focus on rewarding signals up to the current step, exhibiting a one-directional nature and lacking a mechanism to model the distance to the final target. To address this problem, we draw inspiration from the A* algorithm, which states that an effective supervisory signal should simultaneously consider the incurred cost and the estimated cost for reaching the target. Building on this key insight, we introduce BiRM, a novel process supervision model that not only evaluates the correctness of previous steps but also models the probability of future success. We conduct extensive experiments on mathematical reasoning tasks and demonstrate that BiRM provides more precise evaluations of LLM reasoning steps, achieving an improvement of 3.1% on Gaokao2023 over PRM under the Best-of-N sampling method. Besides, in search-based strategies, BiRM provides more comprehensive guidance and outperforms ORM by 5.0% and PRM by 3.8% respectively on MATH-500.
Problem

Research questions and friction points this paper is trying to address.

Improves process supervision in LLM reasoning
Introduces bidirectional rewarding signals for future success
Enhances evaluation accuracy in mathematical reasoning tasks
Innovation

Methods, ideas, or system contributions that make the work stand out.

Bi-directional rewarding signals for process supervision
A* algorithm-inspired cost and target estimation
BiRM model evaluates past and future success
๐Ÿ”Ž Similar Papers
No similar papers found.
Wenxiang Chen
Wenxiang Chen
Fudan University
LLM reasoningLLM-based agent
W
Wei He
School of Computer Science, Fudan University
Zhiheng Xi
Zhiheng Xi
Fudan University
LLM ReasoningLLM-based Agents
Honglin Guo
Honglin Guo
Fudan University
Large Language Model
B
Boyang Hong
School of Computer Science, Fudan University
Jiazheng Zhang
Jiazheng Zhang
Fudan University
Large Language ModelNatural Language ProcessingData Mining
R
Rui Zheng
School of Computer Science, Fudan University
N
Nijun Li
Cognitive AI Lab, Shanghai Huawei Technologies, China
T
Tao Gui
Institute of Modern Languages and Linguistics, Fudan University
Y
Yun Li
Cognitive AI Lab, Shanghai Huawei Technologies, China
Q
Qi Zhang
School of Computer Science, Fudan University
X
Xuanjing Huang
School of Computer Science, Fudan University