🤖 AI Summary
This work addresses the challenge of reliably training process reward models (PRMs) in scientific reasoning tasks such as biological inference, where expert-annotated fine-grained supervision signals are scarce and existing weak-to-strong generalization methods lack effective mechanisms to filter noisy supervision. To this end, we propose the DC-W2S framework, which introduces a dual-consensus mechanism—combining self-consistency (SC) within weak supervisors and neighborhood consistency (NC) in the embedding space—to stratify the reliability of noisy supervision signals. Leveraging instance-level balanced sampling and label-level reliability-aware masking, DC-W2S enables curriculum learning without requiring extensive expert step-by-step annotations. Our approach yields robust PRMs that significantly enhance the reliability and effectiveness of process reward modeling in complex biological reasoning tasks.
📝 Abstract
In scientific reasoning tasks, the veracity of the reasoning process is as critical as the final outcome. While Process Reward Models (PRMs) offer a solution to the coarse-grained supervision problems inherent in Outcome Reward Models (ORMs), their deployment is hindered by the prohibitive cost of obtaining expert-verified step-wise labels. This paper addresses the challenge of training reliable PRMs using abundant but noisy"weak"supervision. We argue that existing Weak-to-Strong Generalization (W2SG) theories lack prescriptive guidelines for selecting high-quality training signals from noisy data. To bridge this gap, we introduce the Dual-Consensus Weak-to-Strong (DC-W2S) framework. By intersecting Self-Consensus (SC) metrics among weak supervisors with Neighborhood-Consensus (NC) metrics in the embedding space, we stratify supervision signals into distinct reliability regimes. We then employ a curriculum of instance-level balanced sampling and label-level reliability-aware masking to guide the training process. We demonstrate that DC-W2S enables the training of robust PRMs for complex reasoning without exhaustive expert annotation, proving that strategic data curation is more effective than indiscriminate training on large-scale noisy datasets.