DPRM: A Dual Implicit Process Reward Model in Multi-Hop Question Answering

πŸ“… 2025-11-11
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
In multi-hop question answering, existing process reward models (PRMs) struggle to jointly enforce knowledge graph (KG) structural constraints and assess consistency with chain-of-thought (CoT) reasoning paths, while relying either on costly human annotations or failing to model structured reasoning. To address this, we propose the Dual Implicit Process Reward Model (DPRM)β€”the first implicit PRM adapted to KG-based reasoning. DPRM jointly models CoT and KG inference paths, inferring step-wise rewards via outcome supervision without human annotation. It introduces a dual-reward coordination mechanism and explicit reasoning-path consistency constraints to enable mutual verification and joint optimization of CoT and KG reasoning. Furthermore, it integrates KG semantic matching with CoT logical coherence for fine-grained, multi-step reward allocation. Evaluated on multiple benchmarks, DPRM outperforms 13 baselines, achieving up to a 16.6% absolute gain in Hit@1, significantly improving both reasoning accuracy and reliability.

Technology Category

Application Category

πŸ“ Abstract
In multi-hop question answering (MHQA) tasks, Chain of Thought (CoT) improves the quality of generation by guiding large language models (LLMs) through multi-step reasoning, and Knowledge Graphs (KGs) reduce hallucinations via semantic matching. Outcome Reward Models (ORMs) provide feedback after generating the final answers but fail to evaluate the process for multi-step reasoning. Traditional Process Reward Models (PRMs) evaluate the reasoning process but require costly human annotations or rollout generation. While implicit PRM is trained only with outcome signals and derives step rewards through reward parameterization without explicit annotations, it is more suitable for multi-step reasoning in MHQA tasks. However, existing implicit PRM has only been explored for plain text scenarios. When adapting to MHQA tasks, it cannot handle the graph structure constraints in KGs and capture the potential inconsistency between CoT and KG paths. To address these limitations, we propose the DPRM (Dual Implicit Process Reward Model). It trains two implicit PRMs for CoT and KG reasoning in MHQA tasks. Both PRMs, namely KG-PRM and CoT-PRM, derive step-level rewards from outcome signals via reward parameterization without additional explicit annotations. Among them, KG-PRM uses preference pairs to learn structural constraints from KGs. DPRM further introduces a consistency constraint between CoT and KG reasoning steps, making the two PRMs mutually verify and collaboratively optimize the reasoning paths. We also provide a theoretical demonstration of the derivation of process rewards. Experimental results show that our method outperforms 13 baselines on multiple datasets with up to 16.6% improvement on Hit@1.
Problem

Research questions and friction points this paper is trying to address.

Evaluating reasoning process in multi-hop QA without costly human annotations
Handling graph structure constraints in Knowledge Graphs for process evaluation
Addressing inconsistency between Chain of Thought and Knowledge Graph reasoning paths
Innovation

Methods, ideas, or system contributions that make the work stand out.

Dual implicit reward models for reasoning processes
KG-PRM learns structural constraints from knowledge graphs
Consistency constraint aligns CoT and KG reasoning paths
πŸ”Ž Similar Papers
No similar papers found.
X
Xinyi Wang
National University of Defense Technology
Yiping Song
Yiping Song
Student at Peking University
natural language processing
Z
Zhiliang Tian
National University of Defense Technology
B
Bo Liu
Academy of Military Sciences
Tingjin Luo
Tingjin Luo
NUDT
Machine LearningComputer VisionData Mining
M
Minlie Huang
The CoAI Group, Department of Computer Science and Technology, Tsinghua University